Week 1: Introduction & Foundations
● The Shift in Analytics: The focus of data analytics has moved from
backward-looking measurement of well-defined transaction data to forward-facing,
predictive analysis using unstructured interaction data from the web and mobile
devices.
● The Four V's of Big Data: Working with Big Data means dealing with Volume
(massive amounts of data), Velocity (data arriving at high speed), Variety
(structured, semi-structured, and unstructured data), and Veracity(varying levels of
trustworthiness).
● Parallelism:
○ Task Parallelism: Executing many independent tasks at once, like an
operating system multi-tasking.
○ Data Parallelism: Executing the same task simultaneously on different slices
of data.
○ Pipeline Parallelism: Breaking a task into a sequence of stages where
results are passed downstream immediately.
● Scalability vs. Performance: Scalability is the ability to handle more work by adding
resources. However, scalable systems are not automatically performant; distributed
systems have heavy network and coordination overheads, meaning a
single-threaded program can sometimes outperform a distributed system.
● Scaling Types: Scale-Up (Vertical) involves replacing a machine with a more
powerful one, while Scale-Out(Horizontal) involves adding more machines of the
same type to a cluster.
Week 2: Relational Data Processing
● Relational Operators: Relational databases use fundamental operators such as
Projection (removing/adding columns), Selection (filtering rows), Aggregation
(summarizing grouped data), and Join (combining tables).
● Distributed Architectures:
○ Shared Memory: Nodes share CPU and disk, typical for scale-up parallel
databases.
○ Shared Disk: Nodes have their own memory but share a disk.
○ Shared Nothing: Data is spread across independent nodes communicating
only via network, typical for scale-out web-scale systems.
● Distributed Query Processing: Queries require data shuffling primitives like
Broadcasting, Range Partitioning, and Hash Partitioning to move data to the right
nodes.
● Distributed Joins: Because matching rows must end up on the same node, systems
use various strategies: Co-Located Join (no reshuffling needed),
Asymmetric/Symmetric Repartition Join (hash-partitioning one or both tables), or
a Broadcast Join (sending a small table to all nodes).
Week 3: MapReduce
● The Shift in Analytics: The focus of data analytics has moved from
backward-looking measurement of well-defined transaction data to forward-facing,
predictive analysis using unstructured interaction data from the web and mobile
devices.
● The Four V's of Big Data: Working with Big Data means dealing with Volume
(massive amounts of data), Velocity (data arriving at high speed), Variety
(structured, semi-structured, and unstructured data), and Veracity(varying levels of
trustworthiness).
● Parallelism:
○ Task Parallelism: Executing many independent tasks at once, like an
operating system multi-tasking.
○ Data Parallelism: Executing the same task simultaneously on different slices
of data.
○ Pipeline Parallelism: Breaking a task into a sequence of stages where
results are passed downstream immediately.
● Scalability vs. Performance: Scalability is the ability to handle more work by adding
resources. However, scalable systems are not automatically performant; distributed
systems have heavy network and coordination overheads, meaning a
single-threaded program can sometimes outperform a distributed system.
● Scaling Types: Scale-Up (Vertical) involves replacing a machine with a more
powerful one, while Scale-Out(Horizontal) involves adding more machines of the
same type to a cluster.
Week 2: Relational Data Processing
● Relational Operators: Relational databases use fundamental operators such as
Projection (removing/adding columns), Selection (filtering rows), Aggregation
(summarizing grouped data), and Join (combining tables).
● Distributed Architectures:
○ Shared Memory: Nodes share CPU and disk, typical for scale-up parallel
databases.
○ Shared Disk: Nodes have their own memory but share a disk.
○ Shared Nothing: Data is spread across independent nodes communicating
only via network, typical for scale-out web-scale systems.
● Distributed Query Processing: Queries require data shuffling primitives like
Broadcasting, Range Partitioning, and Hash Partitioning to move data to the right
nodes.
● Distributed Joins: Because matching rows must end up on the same node, systems
use various strategies: Co-Located Join (no reshuffling needed),
Asymmetric/Symmetric Repartition Join (hash-partitioning one or both tables), or
a Broadcast Join (sending a small table to all nodes).
Week 3: MapReduce