100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.6 TrustPilot
logo-home
Summary

Summary - Big Data (5294BIDA6Y)

Rating
-
Sold
-
Pages
21
Uploaded on
06-09-2025
Written in
2024/2025

Summary of big data lectures and literature.

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
September 6, 2025
Number of pages
21
Written in
2024/2025
Type
Summary

Subjects

Content preview

Week 1 – Introduction & Foundations
Storing data – then vs now → Now way cheaper and possible to store more data.
Reasons for a data shift → Data generation exploded!

Previously:
-​ Data was traditionally used for measurement
-​ Descriptive backwards facing view about what happened in the past
-​ Businesses captured well understood, well-defined transaction data (e.g., data about orders
and payments)
-​ IT department had monopoly on access to data (end users had to go through IT via ticket
systems for data analysis)
Nowadays:
-​ Data leveraged for strategic analysis, centred on growth
-​ Data is used in a predictive forward facing function
-​ Advent of the web and mobile phones produces unprecedented amount of much less
structured and less defined interaction data
-​ Data centrally stored in the cloud, IT department manages the cloud
-​ End users can directly access and analyse data

The 4 V’s of big data:
1.​ Volume: we have to process a lot of data
2.​ Velocity (snelheid): the data is arriving very fast
3.​ Variety: we have structured, semi-structured and unstructured data from many different
sources
4.​ Veracity (waarheidsgetrouw): we have data of highly varying quality and trustworthiness

Challenges with Volume & Velocity

Can’t we just use lots of computers to process lots of data really fast? → Turns out, programming
distributed systems is really hard (=working with lots of computers). Think about coordination,
concurrency, fault tolerance. → We need ways to write simple but efficient programs which execute
in parallel on large datasets.

Challenges with Variety & Veracity

Can’t we just feed all our data into ml models which find the right answer for us? → No, most data
scientists spend most of their time preparing, cleaning and organizing data instead of analyzing data
and training models. → Many data-driven ML applications are found to reproduce and amplify
(versterken) existing bias and discrimination

Parallelism

Task parallelism (“multi-tasking”) → Execute many independent tasks at once.
-​ E.g., operating system executing different processes at once on a multi-core machine
Data parallelism → Execute the same task in parallel on different slices of the data
-​ E.g., query processing in modern cloud databases which store partitions of the data on
different machines
-​ Think about cars going through border control


1

,Pipeline parallelism → Break a task into a sequence of processing stages. Each stage takes result
from the previous stage as input, with results being passed downstream immediately.
-​ E.g., instruction pipelining in modern CPUs
-​ Think about assembly line (lopende band werk)

Scalability

Scalability → Ability of a system to handle a growing amount of work by adding resources to the
system. Often distinguished how resources are added:
-​ Scale-Up: Replace machine with “beefier”machine (eg., more RAM, more cores)
-​ Scale-Out: Add more machines of the same type
The desired goal in practice:
-​ Linear scalability with number of machines / cores in scale-out settings
-​ “Elastic” scaling in cloud environments

Think before you scale! Scalability != performance
-​ A common misconception is that scalable systems are also automatically performant.
-​ Scalability often comes with increased overheads, especially in distributed settings (e.g.,
network communication, coordination overhead)
-​ This means that as a system scales (i.e., adds more computers or resources to handle larger
workloads), it often faces extra costs or inefficiencies.
-​ Network communication: more machines mean more data being sent between them,
which can slow things down
-​ Coordination overhead: managing tasks across multiple machines (eg., ensuring
they work together correctly) requires extra effort, adding complexity and delays.

McSHerry et al. Single-threaded Rust program on Macbook outperforms many distributed systems
using 100s of cores in graph processing workloads

Week 2 – Relational Data Processing

SELECT [columns you want to see]
FROM [table you start with]
JOIN [table you want to combine with]
ON [shared attribute/key]
WHERE [condition that filters rows]
GROUP BY [attribute you want to groupby];

Relational operators
Projection: This operator modifies each row of a table individually
-​ Remove columns, add new columns by evaluating expressions
-​ SELECT Name, YEAR(DateOfBirth) AS YearofBirth, FROM Customers
Selection: This operator removes individual rows from a table. Only rows that match a given
boolean predicate remain
-​ SELECT * (every column) FROM Years WHERE YearOfBirth >= 1900
(Grouped) aggregation: This operator aggregates information across multiple rows. Computes an
aggregate value (eg., sum) across the rows of each group
-​ Groups defined by a grouping key (otherwise, whole table is aggregated


2

, -​ SELECT MIN(YearOfBirth) FROM Years GROUP BY Country
Join: This operator combines information from two tables. Tables are typically joined by a key. If no
key is given, a join produces the Cartesian product (all pairs of rows).
-​ SELECT Students.Name, ExamResults.Grade FROM Students JOIN ExamResults ON (ID =
StudentID)

Life of a relational database query
1.​ The SQL query is parsed and translated into a logical representation
2.​ The logical representation is transformed into an “optimal” query plan
3.​ The plan is evaluated by calling the sequence of operators, which read in the data and produce
the result
4.​ The result is returned to the user

Example: Find tasty fish from a table: SELECT breed, qty FROM fishtank WHERE tasty = TRUE

Distributed Database Fundamental

Distributed Database → Simply spoken, a distributed database is a database that is spread
(“distributed”) across multiple machines.
-​ Important: For an end-user, interacting with a distributed database should be indistinguishable
from a non-distributed one

Why do we distribute data?
Performance: With data sizes growing exponentially, the need for fast data processing is outgrowing
individual machines
Elasticity: The database can be quickly & flexibly scaled to fit the requirements by adding (or
removing) resources
Fault-tolerance: Running on more than one node (=individual computer or server in a distributed
system) allows the system to better recover from hardware failures

Classifying distributed databases

1.​ Scalability: Scale-Up vs Scale-Out
Scaling-up: Move the database to a bigger box (faster CPU, more cores, more memory, faster disk,
FPGAs/GPUs)
-​ Typically better performance, but expensive to buy & inflexible to scale
Scaling-out: Distribute the database across multiple nodes
-​ Often slower (due to operational overhead), but a lot cheaper, more flexible, and more
fault-tolerant

2.​ Implementation: Parallel vs Distributed
Parallel database
-​ Runs on tightly-coupled nodes (eg., a cluster, or a multi-processor / multi-core system)
-​ A single database system that runs on multiple CPUs/cores in one location
-​ Implementation focus on multi-threading, inter-process communication
-​ multi-threading: Technique where a single program (process) runs multiple smaller
tasks (threads) at the same time.



3

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
samirahbakker1107 Universiteit van Amsterdam
Follow You need to be logged in order to follow users or courses
Sold
14
Member since
4 months
Number of followers
0
Documents
12
Last sold
1 month ago

3.7

3 reviews

5
2
4
0
3
0
2
0
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions