Summary

Summary - Big Data (5294BIDA6Y)

Rating

Sold

Pages

Uploaded on

06-09-2025

Written in

2024/2025

Summary of big data lectures and literature.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit van Amsterdam (UvA)
Study: Information studies: Data science
Course: Big Data (5294BIDA6Y)

All documents for this subject (1)

Document information

Uploaded on: September 6, 2025
Number of pages: 21
Written in: 2024/2025
Type: Summary

Subjects

big data
data science
master
uva

Content preview

Week 1 – Introduction & Foundations
Storing data – then vs now → Now way cheaper and possible to store more data.
Reasons for a data shift → Data generation exploded!

Previously:
- Data was traditionally used for measurement
- Descriptive backwards facing view about what happened in the past
- Businesses captured well understood, well-defined transaction data (e.g., data about orders
and payments)
- IT department had monopoly on access to data (end users had to go through IT via ticket
systems for data analysis)
Nowadays:
- Data leveraged for strategic analysis, centred on growth
- Data is used in a predictive forward facing function
- Advent of the web and mobile phones produces unprecedented amount of much less
structured and less defined interaction data
- Data centrally stored in the cloud, IT department manages the cloud
- End users can directly access and analyse data

The 4 V’s of big data:
1. Volume: we have to process a lot of data
2. Velocity (snelheid): the data is arriving very fast
3. Variety: we have structured, semi-structured and unstructured data from many different
sources
4. Veracity (waarheidsgetrouw): we have data of highly varying quality and trustworthiness

Challenges with Volume & Velocity

Can’t we just use lots of computers to process lots of data really fast? → Turns out, programming
distributed systems is really hard (=working with lots of computers). Think about coordination,
concurrency, fault tolerance. → We need ways to write simple but efficient programs which execute
in parallel on large datasets.

Challenges with Variety & Veracity

Can’t we just feed all our data into ml models which find the right answer for us? → No, most data
scientists spend most of their time preparing, cleaning and organizing data instead of analyzing data
and training models. → Many data-driven ML applications are found to reproduce and amplify
(versterken) existing bias and discrimination

Parallelism

Task parallelism (“multi-tasking”) → Execute many independent tasks at once.
- E.g., operating system executing different processes at once on a multi-core machine
Data parallelism → Execute the same task in parallel on different slices of the data
- E.g., query processing in modern cloud databases which store partitions of the data on
different machines
- Think about cars going through border control

1

,Pipeline parallelism → Break a task into a sequence of processing stages. Each stage takes result
from the previous stage as input, with results being passed downstream immediately.
- E.g., instruction pipelining in modern CPUs
- Think about assembly line (lopende band werk)

Scalability

Scalability → Ability of a system to handle a growing amount of work by adding resources to the
system. Often distinguished how resources are added:
- Scale-Up: Replace machine with “beefier”machine (eg., more RAM, more cores)
- Scale-Out: Add more machines of the same type
The desired goal in practice:
- Linear scalability with number of machines / cores in scale-out settings
- “Elastic” scaling in cloud environments

Think before you scale! Scalability != performance
- A common misconception is that scalable systems are also automatically performant.
- Scalability often comes with increased overheads, especially in distributed settings (e.g.,
network communication, coordination overhead)
- This means that as a system scales (i.e., adds more computers or resources to handle larger
workloads), it often faces extra costs or inefficiencies.
- Network communication: more machines mean more data being sent between them,
which can slow things down
- Coordination overhead: managing tasks across multiple machines (eg., ensuring
they work together correctly) requires extra effort, adding complexity and delays.

McSHerry et al. Single-threaded Rust program on Macbook outperforms many distributed systems
using 100s of cores in graph processing workloads

Week 2 – Relational Data Processing

SELECT [columns you want to see]
FROM [table you start with]
JOIN [table you want to combine with]
ON [shared attribute/key]
WHERE [condition that filters rows]
GROUP BY [attribute you want to groupby];

Relational operators
Projection: This operator modifies each row of a table individually
- Remove columns, add new columns by evaluating expressions
- SELECT Name, YEAR(DateOfBirth) AS YearofBirth, FROM Customers
Selection: This operator removes individual rows from a table. Only rows that match a given
boolean predicate remain
- SELECT * (every column) FROM Years WHERE YearOfBirth >= 1900
(Grouped) aggregation: This operator aggregates information across multiple rows. Computes an
aggregate value (eg., sum) across the rows of each group
- Groups defined by a grouping key (otherwise, whole table is aggregated

2

, - SELECT MIN(YearOfBirth) FROM Years GROUP BY Country
Join: This operator combines information from two tables. Tables are typically joined by a key. If no
key is given, a join produces the Cartesian product (all pairs of rows).
- SELECT Students.Name, ExamResults.Grade FROM Students JOIN ExamResults ON (ID =
StudentID)

Life of a relational database query
1. The SQL query is parsed and translated into a logical representation
2. The logical representation is transformed into an “optimal” query plan
3. The plan is evaluated by calling the sequence of operators, which read in the data and produce
the result
4. The result is returned to the user

Example: Find tasty fish from a table: SELECT breed, qty FROM fishtank WHERE tasty = TRUE

Distributed Database Fundamental

Distributed Database → Simply spoken, a distributed database is a database that is spread
(“distributed”) across multiple machines.
- Important: For an end-user, interacting with a distributed database should be indistinguishable
from a non-distributed one

Why do we distribute data?
Performance: With data sizes growing exponentially, the need for fast data processing is outgrowing
individual machines
Elasticity: The database can be quickly & flexibly scaled to fit the requirements by adding (or
removing) resources
Fault-tolerance: Running on more than one node (=individual computer or server in a distributed
system) allows the system to better recover from hardware failures

Classifying distributed databases

1. Scalability: Scale-Up vs Scale-Out
Scaling-up: Move the database to a bigger box (faster CPU, more cores, more memory, faster disk,
FPGAs/GPUs)
- Typically better performance, but expensive to buy & inflexible to scale
Scaling-out: Distribute the database across multiple nodes
- Often slower (due to operational overhead), but a lot cheaper, more flexible, and more
fault-tolerant

2. Implementation: Parallel vs Distributed
Parallel database
- Runs on tightly-coupled nodes (eg., a cluster, or a multi-processor / multi-core system)
- A single database system that runs on multiple CPUs/cores in one location
- Implementation focus on multi-threading, inter-process communication
- multi-threading: Technique where a single program (process) runs multiple smaller
tasks (threads) at the same time.

3

$9.26

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

samirahbakker1107

3.7

(3)

Get to know the seller

samirahbakker1107 Universiteit van Amsterdam

View profile

Sold

Member since

4 months

Number of followers

Documents

Last sold

1 month ago

3.7

3 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller samirahbakker1107. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.26. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 52953 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary - Big Data (5294BIDA6Y)

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?