100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4,6 TrustPilot
logo-home
Notas de lectura

Extensive summary of big data lecture and practical notes

Puntuación
-
Vendido
2
Páginas
119
Subido en
28-10-2025
Escrito en
2025/2026

The documents consists of the course summary of the big data course. It includes the theory lecture notes but also the practical notes. Both will be tested on the exam. It is an extensive summary with examples and extra notes, a lot of the things said in the lectures are written down verbatim or summarized in a clear way.

Mostrar más Leer menos
Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
28 de octubre de 2025
Número de páginas
119
Escrito en
2025/2026
Tipo
Notas de lectura
Profesor(es)
Boris čule and stijn rotman
Contiene
Todas las clases

Temas

Vista previa del contenido

Lecture 1
Concept of big data depends on:
- What kind of data you’re dealing with
- What resources you have
- What you want to do with the data
⇒ No fixed definition, the concept changes over time
- In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyse it

Storing the data:
- Recent trends:
- Storage is (relatively) cheap and easy
- Companies and governments preserve huge amounts of data
- Easier
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories,
patient histories, financial transactions, GPS trajectories, usage logs,
images/ audio/ video, sensor data ⇒ more data being collected
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction (flood warnings), fraud detection,
predictive maintenance (sensor data, output of machine), image recognition,
COVID contract tracing
⇒ Issues:
- The quantity of data
- The speed with which you have to process the data, to produce outputs

Making data useful:
- However:
- Data analysis is computationally intensive and expensive
- Examples:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different
items, independent of the number of transactions (e.g., market basket analysis)
- Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample

So what is big data:
- Dependent on the use case
- Data becomes big data when it becomes too large or too complex to be analyzed with traditional
data analysis software
- Analysis becomes too slow or too unreliable
- Systems become unresponsive (error messages, run out of hard disk space)
- Day-to-day business is impacted

,Three aspects of big data:
- Volume:
- The actual quantity of data that is gathered (gigabytes, etc.) ⇒ how much
data do you have?
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/ transaction
- Can be an issue if there’s too much of it
- Variety:
- The different types of data that is gathered
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data
are irregular
- The variety of data makes the analysis more complex and challenging
- Velocity
- The speed at which new data is coming in and the speed at which data must be handled
- The time intervals of which data comes in
- If the data comes in at a higher speed than you can handle then you’ve a
problem
- Two aspects:
- How fast is new data coming in?
- How fast do you need to handle the new data? (how fast do you need to
produce output?)
- May result in irrecoverable bottlenecks

What can we do about it?
- Invest in hardware
- Store more data ⇒ doesn’t necessarily help with sufficiently speeding up the
computations
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity
- Exponential complexity = if you have to process 100 data items, it takes
2^100 time units
- With more hardware (2 instead of 1 pc) ⇒ 2^99 ⇒ so still a lot
- Linearly reducing the runtime doesn’t help if the run time is exponential
- Design intelligent algorithms to speed up the analysis
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/ time
- Take longer to build a model that can then be used on-the-fly (recommender systems,
precomputed)
- We focus on the latter

,Parallel computing
Goal: leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores of a CPU
- If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the cloud
to do some computations that take 10 hours, but then what?

Goal of parallel processing is to reduce computation time (not to simplify the problem)
- Split the problem into smaller parts and assign these smaller parts to different processors/
different machines
- Algorithms are typically designed to solve a problem in serial fashion
- To fully leverage the power of your multicore CPU you need to adapt your algorithm: split
your problem into smaller parts that can be executed in parallel
- We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e. embarrassingly
parallel
- If an algorithm is embarrassingly parallel, then you can do this and you can achieve
optimal runtime gains
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time
- E.g., task takes 4 hours, give to 4 different machines ⇒ done in 1 hour ⇒ linear
speed up

Example: adding numbers in parallel




- Want to add 8 numbers ⇒ need 7 operations
- Parallel:
- 4 in first step, 2 in second, 1 in third step
- 7 steps turned into four parallel processes ⇒ takes 3 units of time ⇒ not linear
speedup

, Parallel computation:
- Task parallelism: multiple tasks are applied on the same data in parallel
- E.g., you want to do some analysis/ multiple analysis that are independent
from each other ⇒ parallelise this, every machine same data set, but every
machine will do a different task on the data set
- Imagine you have a book (the dataset).
- Person A counts how many words are in the book
- Person B finds the most common word.
- Data parallelism: a calculation is performed in parallel on many different data chunks
- E.g., you want to do a single task on your big data set, divide the data into chunks, and
give each machine a part of the data so that this same task can be processed on each
machine on a different part of the data
- Each machine (or core) gets just a portion of the dataset, and does the same kind of
work.
- Imagine you have a big stack of 1,000 books (the dataset).
- Person A reads books 1–250 and counts the words.
- Person B reads books 251–500 and counts the words.
$9.46
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
StudentSums Erasmus Universiteit Rotterdam
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
42
Miembro desde
5 año
Número de seguidores
0
Documentos
16
Última venta
1 mes hace

3.3

3 reseñas

5
1
4
1
3
0
2
0
1
1

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes