Week 1 – Introduction & Foundations
Storing data – then vs now → Now way cheaper and possible to store more data.
Reasons for a data shift → Data generation exploded!
Previously:
- Data was traditionally used for measurement
- Descriptive backwards facing view about what happened in the past
- Businesses captured well understood, well-defined transaction data (e.g., data about orders
and payments)
- IT department had monopoly on access to data (end users had to go through IT via ticket
systems for data analysis)
Nowadays:
- Data leveraged for strategic analysis, centred on growth
- Data is used in a predictive forward facing function
- Advent of the web and mobile phones produces unprecedented amount of much less
structured and less defined interaction data
- Data centrally stored in the cloud, IT department manages the cloud
- End users can directly access and analyse data
The 4 V’s of big data:
1. Volume: we have to process a lot of data
2. Velocity (snelheid): the data is arriving very fast
3. Variety: we have structured, semi-structured and unstructured data from many different
sources
4. Veracity (waarheidsgetrouw): we have data of highly varying quality and trustworthiness
Challenges with Volume & Velocity
Can’t we just use lots of computers to process lots of data really fast? → Turns out, programming
distributed systems is really hard (=working with lots of computers). Think about coordination,
concurrency, fault tolerance. → We need ways to write simple but efficient programs which execute
in parallel on large datasets.
Challenges with Variety & Veracity
Can’t we just feed all our data into ml models which find the right answer for us? → No, most data
scientists spend most of their time preparing, cleaning and organizing data instead of analyzing data
and training models. → Many data-driven ML applications are found to reproduce and amplify
(versterken) existing bias and discrimination
Parallelism
Task parallelism (“multi-tasking”) → Execute many independent tasks at once.
- E.g., operating system executing different processes at once on a multi-core machine
Data parallelism → Execute the same task in parallel on different slices of the data
- E.g., query processing in modern cloud databases which store partitions of the data on
different machines
- Think about cars going through border control
1
,Pipeline parallelism → Break a task into a sequence of processing stages. Each stage takes result
from the previous stage as input, with results being passed downstream immediately.
- E.g., instruction pipelining in modern CPUs
- Think about assembly line (lopende band werk)
Scalability
Scalability → Ability of a system to handle a growing amount of work by adding resources to the
system. Often distinguished how resources are added:
- Scale-Up: Replace machine with “beefier”machine (eg., more RAM, more cores)
- Scale-Out: Add more machines of the same type
The desired goal in practice:
- Linear scalability with number of machines / cores in scale-out settings
- “Elastic” scaling in cloud environments
Think before you scale! Scalability != performance
- A common misconception is that scalable systems are also automatically performant.
- Scalability often comes with increased overheads, especially in distributed settings (e.g.,
network communication, coordination overhead)
- This means that as a system scales (i.e., adds more computers or resources to handle larger
workloads), it often faces extra costs or inefficiencies.
- Network communication: more machines mean more data being sent between them,
which can slow things down
- Coordination overhead: managing tasks across multiple machines (eg., ensuring
they work together correctly) requires extra effort, adding complexity and delays.
McSHerry et al. Single-threaded Rust program on Macbook outperforms many distributed systems
using 100s of cores in graph processing workloads
Week 2 – Relational Data Processing
SELECT [columns you want to see]
FROM [table you start with]
JOIN [table you want to combine with]
ON [shared attribute/key]
WHERE [condition that filters rows]
GROUP BY [attribute you want to groupby];
Relational operators
Projection: This operator modifies each row of a table individually
- Remove columns, add new columns by evaluating expressions
- SELECT Name, YEAR(DateOfBirth) AS YearofBirth, FROM Customers
Selection: This operator removes individual rows from a table. Only rows that match a given
boolean predicate remain
- SELECT * (every column) FROM Years WHERE YearOfBirth >= 1900
(Grouped) aggregation: This operator aggregates information across multiple rows. Computes an
aggregate value (eg., sum) across the rows of each group
- Groups defined by a grouping key (otherwise, whole table is aggregated
2
, - SELECT MIN(YearOfBirth) FROM Years GROUP BY Country
Join: This operator combines information from two tables. Tables are typically joined by a key. If no
key is given, a join produces the Cartesian product (all pairs of rows).
- SELECT Students.Name, ExamResults.Grade FROM Students JOIN ExamResults ON (ID =
StudentID)
Life of a relational database query
1. The SQL query is parsed and translated into a logical representation
2. The logical representation is transformed into an “optimal” query plan
3. The plan is evaluated by calling the sequence of operators, which read in the data and produce
the result
4. The result is returned to the user
Example: Find tasty fish from a table: SELECT breed, qty FROM fishtank WHERE tasty = TRUE
Distributed Database Fundamental
Distributed Database → Simply spoken, a distributed database is a database that is spread
(“distributed”) across multiple machines.
- Important: For an end-user, interacting with a distributed database should be indistinguishable
from a non-distributed one
Why do we distribute data?
Performance: With data sizes growing exponentially, the need for fast data processing is outgrowing
individual machines
Elasticity: The database can be quickly & flexibly scaled to fit the requirements by adding (or
removing) resources
Fault-tolerance: Running on more than one node (=individual computer or server in a distributed
system) allows the system to better recover from hardware failures
Classifying distributed databases
1. Scalability: Scale-Up vs Scale-Out
Scaling-up: Move the database to a bigger box (faster CPU, more cores, more memory, faster disk,
FPGAs/GPUs)
- Typically better performance, but expensive to buy & inflexible to scale
Scaling-out: Distribute the database across multiple nodes
- Often slower (due to operational overhead), but a lot cheaper, more flexible, and more
fault-tolerant
2. Implementation: Parallel vs Distributed
Parallel database
- Runs on tightly-coupled nodes (eg., a cluster, or a multi-processor / multi-core system)
- A single database system that runs on multiple CPUs/cores in one location
- Implementation focus on multi-threading, inter-process communication
- multi-threading: Technique where a single program (process) runs multiple smaller
tasks (threads) at the same time.
3
Storing data – then vs now → Now way cheaper and possible to store more data.
Reasons for a data shift → Data generation exploded!
Previously:
- Data was traditionally used for measurement
- Descriptive backwards facing view about what happened in the past
- Businesses captured well understood, well-defined transaction data (e.g., data about orders
and payments)
- IT department had monopoly on access to data (end users had to go through IT via ticket
systems for data analysis)
Nowadays:
- Data leveraged for strategic analysis, centred on growth
- Data is used in a predictive forward facing function
- Advent of the web and mobile phones produces unprecedented amount of much less
structured and less defined interaction data
- Data centrally stored in the cloud, IT department manages the cloud
- End users can directly access and analyse data
The 4 V’s of big data:
1. Volume: we have to process a lot of data
2. Velocity (snelheid): the data is arriving very fast
3. Variety: we have structured, semi-structured and unstructured data from many different
sources
4. Veracity (waarheidsgetrouw): we have data of highly varying quality and trustworthiness
Challenges with Volume & Velocity
Can’t we just use lots of computers to process lots of data really fast? → Turns out, programming
distributed systems is really hard (=working with lots of computers). Think about coordination,
concurrency, fault tolerance. → We need ways to write simple but efficient programs which execute
in parallel on large datasets.
Challenges with Variety & Veracity
Can’t we just feed all our data into ml models which find the right answer for us? → No, most data
scientists spend most of their time preparing, cleaning and organizing data instead of analyzing data
and training models. → Many data-driven ML applications are found to reproduce and amplify
(versterken) existing bias and discrimination
Parallelism
Task parallelism (“multi-tasking”) → Execute many independent tasks at once.
- E.g., operating system executing different processes at once on a multi-core machine
Data parallelism → Execute the same task in parallel on different slices of the data
- E.g., query processing in modern cloud databases which store partitions of the data on
different machines
- Think about cars going through border control
1
,Pipeline parallelism → Break a task into a sequence of processing stages. Each stage takes result
from the previous stage as input, with results being passed downstream immediately.
- E.g., instruction pipelining in modern CPUs
- Think about assembly line (lopende band werk)
Scalability
Scalability → Ability of a system to handle a growing amount of work by adding resources to the
system. Often distinguished how resources are added:
- Scale-Up: Replace machine with “beefier”machine (eg., more RAM, more cores)
- Scale-Out: Add more machines of the same type
The desired goal in practice:
- Linear scalability with number of machines / cores in scale-out settings
- “Elastic” scaling in cloud environments
Think before you scale! Scalability != performance
- A common misconception is that scalable systems are also automatically performant.
- Scalability often comes with increased overheads, especially in distributed settings (e.g.,
network communication, coordination overhead)
- This means that as a system scales (i.e., adds more computers or resources to handle larger
workloads), it often faces extra costs or inefficiencies.
- Network communication: more machines mean more data being sent between them,
which can slow things down
- Coordination overhead: managing tasks across multiple machines (eg., ensuring
they work together correctly) requires extra effort, adding complexity and delays.
McSHerry et al. Single-threaded Rust program on Macbook outperforms many distributed systems
using 100s of cores in graph processing workloads
Week 2 – Relational Data Processing
SELECT [columns you want to see]
FROM [table you start with]
JOIN [table you want to combine with]
ON [shared attribute/key]
WHERE [condition that filters rows]
GROUP BY [attribute you want to groupby];
Relational operators
Projection: This operator modifies each row of a table individually
- Remove columns, add new columns by evaluating expressions
- SELECT Name, YEAR(DateOfBirth) AS YearofBirth, FROM Customers
Selection: This operator removes individual rows from a table. Only rows that match a given
boolean predicate remain
- SELECT * (every column) FROM Years WHERE YearOfBirth >= 1900
(Grouped) aggregation: This operator aggregates information across multiple rows. Computes an
aggregate value (eg., sum) across the rows of each group
- Groups defined by a grouping key (otherwise, whole table is aggregated
2
, - SELECT MIN(YearOfBirth) FROM Years GROUP BY Country
Join: This operator combines information from two tables. Tables are typically joined by a key. If no
key is given, a join produces the Cartesian product (all pairs of rows).
- SELECT Students.Name, ExamResults.Grade FROM Students JOIN ExamResults ON (ID =
StudentID)
Life of a relational database query
1. The SQL query is parsed and translated into a logical representation
2. The logical representation is transformed into an “optimal” query plan
3. The plan is evaluated by calling the sequence of operators, which read in the data and produce
the result
4. The result is returned to the user
Example: Find tasty fish from a table: SELECT breed, qty FROM fishtank WHERE tasty = TRUE
Distributed Database Fundamental
Distributed Database → Simply spoken, a distributed database is a database that is spread
(“distributed”) across multiple machines.
- Important: For an end-user, interacting with a distributed database should be indistinguishable
from a non-distributed one
Why do we distribute data?
Performance: With data sizes growing exponentially, the need for fast data processing is outgrowing
individual machines
Elasticity: The database can be quickly & flexibly scaled to fit the requirements by adding (or
removing) resources
Fault-tolerance: Running on more than one node (=individual computer or server in a distributed
system) allows the system to better recover from hardware failures
Classifying distributed databases
1. Scalability: Scale-Up vs Scale-Out
Scaling-up: Move the database to a bigger box (faster CPU, more cores, more memory, faster disk,
FPGAs/GPUs)
- Typically better performance, but expensive to buy & inflexible to scale
Scaling-out: Distribute the database across multiple nodes
- Often slower (due to operational overhead), but a lot cheaper, more flexible, and more
fault-tolerant
2. Implementation: Parallel vs Distributed
Parallel database
- Runs on tightly-coupled nodes (eg., a cluster, or a multi-processor / multi-core system)
- A single database system that runs on multiple CPUs/cores in one location
- Implementation focus on multi-threading, inter-process communication
- multi-threading: Technique where a single program (process) runs multiple smaller
tasks (threads) at the same time.
3