100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.6 TrustPilot
logo-home
Class notes

Extensive summary of big data lecture and practical notes

Rating
-
Sold
2
Pages
119
Uploaded on
28-10-2025
Written in
2025/2026

The documents consists of the course summary of the big data course. It includes the theory lecture notes but also the practical notes. Both will be tested on the exam. It is an extensive summary with examples and extra notes, a lot of the things said in the lectures are written down verbatim or summarized in a clear way.

Show more Read less
Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
October 28, 2025
Number of pages
119
Written in
2025/2026
Type
Class notes
Professor(s)
Boris čule and stijn rotman
Contains
All classes

Subjects

Content preview

Lecture 1
Concept of big data depends on:
- What kind of data you’re dealing with
- What resources you have
- What you want to do with the data
⇒ No fixed definition, the concept changes over time
- In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyse it

Storing the data:
- Recent trends:
- Storage is (relatively) cheap and easy
- Companies and governments preserve huge amounts of data
- Easier
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories,
patient histories, financial transactions, GPS trajectories, usage logs,
images/ audio/ video, sensor data ⇒ more data being collected
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction (flood warnings), fraud detection,
predictive maintenance (sensor data, output of machine), image recognition,
COVID contract tracing
⇒ Issues:
- The quantity of data
- The speed with which you have to process the data, to produce outputs

Making data useful:
- However:
- Data analysis is computationally intensive and expensive
- Examples:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different
items, independent of the number of transactions (e.g., market basket analysis)
- Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample

So what is big data:
- Dependent on the use case
- Data becomes big data when it becomes too large or too complex to be analyzed with traditional
data analysis software
- Analysis becomes too slow or too unreliable
- Systems become unresponsive (error messages, run out of hard disk space)
- Day-to-day business is impacted

,Three aspects of big data:
- Volume:
- The actual quantity of data that is gathered (gigabytes, etc.) ⇒ how much
data do you have?
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/ transaction
- Can be an issue if there’s too much of it
- Variety:
- The different types of data that is gathered
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data
are irregular
- The variety of data makes the analysis more complex and challenging
- Velocity
- The speed at which new data is coming in and the speed at which data must be handled
- The time intervals of which data comes in
- If the data comes in at a higher speed than you can handle then you’ve a
problem
- Two aspects:
- How fast is new data coming in?
- How fast do you need to handle the new data? (how fast do you need to
produce output?)
- May result in irrecoverable bottlenecks

What can we do about it?
- Invest in hardware
- Store more data ⇒ doesn’t necessarily help with sufficiently speeding up the
computations
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity
- Exponential complexity = if you have to process 100 data items, it takes
2^100 time units
- With more hardware (2 instead of 1 pc) ⇒ 2^99 ⇒ so still a lot
- Linearly reducing the runtime doesn’t help if the run time is exponential
- Design intelligent algorithms to speed up the analysis
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/ time
- Take longer to build a model that can then be used on-the-fly (recommender systems,
precomputed)
- We focus on the latter

,Parallel computing
Goal: leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores of a CPU
- If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the cloud
to do some computations that take 10 hours, but then what?

Goal of parallel processing is to reduce computation time (not to simplify the problem)
- Split the problem into smaller parts and assign these smaller parts to different processors/
different machines
- Algorithms are typically designed to solve a problem in serial fashion
- To fully leverage the power of your multicore CPU you need to adapt your algorithm: split
your problem into smaller parts that can be executed in parallel
- We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e. embarrassingly
parallel
- If an algorithm is embarrassingly parallel, then you can do this and you can achieve
optimal runtime gains
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time
- E.g., task takes 4 hours, give to 4 different machines ⇒ done in 1 hour ⇒ linear
speed up

Example: adding numbers in parallel




- Want to add 8 numbers ⇒ need 7 operations
- Parallel:
- 4 in first step, 2 in second, 1 in third step
- 7 steps turned into four parallel processes ⇒ takes 3 units of time ⇒ not linear
speedup

, Parallel computation:
- Task parallelism: multiple tasks are applied on the same data in parallel
- E.g., you want to do some analysis/ multiple analysis that are independent
from each other ⇒ parallelise this, every machine same data set, but every
machine will do a different task on the data set
- Imagine you have a book (the dataset).
- Person A counts how many words are in the book
- Person B finds the most common word.
- Data parallelism: a calculation is performed in parallel on many different data chunks
- E.g., you want to do a single task on your big data set, divide the data into chunks, and
give each machine a part of the data so that this same task can be processed on each
machine on a different part of the data
- Each machine (or core) gets just a portion of the dataset, and does the same kind of
work.
- Imagine you have a big stack of 1,000 books (the dataset).
- Person A reads books 1–250 and counts the words.
- Person B reads books 251–500 and counts the words.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
StudentSums Erasmus Universiteit Rotterdam
Follow You need to be logged in order to follow users or courses
Sold
42
Member since
5 year
Number of followers
0
Documents
16
Last sold
1 month ago

3.3

3 reviews

5
1
4
1
3
0
2
0
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions