Date Location Topic Notes Reading
Week 1
2022-08-30
13:00-15:00
Sal-B Introduction [slides] [printable]
[video 2021]
The NIST Definition of Cloud Computing [pdf]
Above the Clouds: A Berkeley View of Cloud Computing [pdf]
A Comparative Taxonomy and Survey of Public Cloud Infrastructure Vendors [pdf]
2022-08-31
13:00-15:00
Sal-B Storage
GFS
[slides] [printable]
[video 2021]
[lab1]
The Google File System [pdf]
Week 2
2022-09-06
13:00-15:00
Sal-B Storage
(BigTable, Cassandra, Neo4j)
[slides] [printable]
[video 2021]
Bigtable: A Distributed Storage System for Structured Data [pdf]
Cassandra: A Decentralized Structured Storage System [pdf]
Graph Databases (Ch. 3, 6)
Neo4j Documentation [link]
2022-09-08
15:00-17:00
Sal-B Scala [slides] [printable]
[video 2021]
Scala By Example [pdf]
Week 3
2022-09-13
13:00-15:00
Sal-B Parallel Data Processing
(MapReduce, FlumeJava)
[slides] [printable]
[video 2021]
MapReduce Simplifed Data Processing on Large Clusters [pdf]
FlumeJava: Easy, Efficient Data-Parallel Pipelines [pdf]
Data-Intensive Text Processing with MapReduce (Ch. 2-3)
2022-09-15
15:00-17:00
Sal-B Parallel Data Processing
(Spark)
[slides] [printable]
[video 2021]
[lab2] [lab2 src]
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing [pdf]
Spark - The Definitive Guide (Ch. 2, 12-14)
Learning Spark (Ch. 1-2)
Spark Documentation [link]
2022-09-16
10:00-12:00
Sal-B Lab 1 [slides]
Week 4
2022-09-20
13:00-15:00
Sal-B Structured Data Processing
(Spark SQL)
[slides] [printable]
[video 2021]
Spark SQL: Relational Data Processing in Spark [pdf]
Spark - The Definitive Guide (Ch. 4-11)
Learning Spark (Ch. 3-6)
Spark SQL Documentation [link]
2022-09-22
15:00-17:00
Sal-B Stream Processing
(Introduction, Kafka)
[slides] [printable]
[video 2021]
[lab3] [lab3 src]
Kafka: a Distributed Messaging System for Log Processing [pdf]
Kafka Documentation [link]
A Survey on the Evolution of Stream Processing Systems [pdf]
2022-09-23
10:00-12:00
Sal-B Lab2 [slides]
Week 5
2022-09-27
13:00-15:00
Sal-B Stream Processing
(Spark Streaming, Beam)
[slides] [printable] [src]
[video 2021]
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters [pdf]
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark [pdf]
MillWheel: Fault-Tolerant Stream Processing at Internet Scale [pdf]
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing [pdf]
Spark - The Definitive Guide (Ch. 20-23)
Learning Spark (Ch. 8)
Spark Streaming Documentation [link]
Beam Documentation [link]
2022-09-29
15:00-17:00
Sal-B Graph Processing
(Pregel, GraphLab, GraphX)
[slides] [printable]
[video 2021]
[lab4] [lab4 src]
Pregel: A System for Large-Scale Graph Processing [pdf]
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud [pdf]
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs [pdf]
GraphX: Graph Processing in a Distributed Dataflow Framework [pdf]
Spark - The Definitive Guide (Ch. 30)
GraphX Documentation [link]
Week 6
2022-10-04
13:00-15:00
Sal-B Resource Management
(Mesos, YARN, Borg, Kubernetes)
[slides] [printable]
[video 2021]
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center [pdf]
Apache Hadoop YARN: Yet Another Resource Negotiator [pdf]
Large-Scale Cluster Management at Google with Borg [pdf]
2022-10-05
15:00-17:00
Sal-B Cloud Data Lakes [slides] [printable]
[video 2021]
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores [pdf]
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics [pdf]
Learning Spark (Ch. 9)
Week 7
2022-10-11
13:00-15:00
Sal-A Mohammadhossein Andjedani
Principal MLOps Engineer at King
[slides]