Notas de lectura

Extensive summary of big data lecture and practical notes

Puntuación

Vendido

Páginas

119

Subido en

28-10-2025

Escrito en

2025/2026

The documents consists of the course summary of the big data course. It includes the theory lecture notes but also the practical notes. Both will be tested on the exam. It is an extensive summary with examples and extra notes, a lot of the things said in the lectures are written down verbatim or summarized in a clear way.

Mostrar más Leer menos

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Tilburg University (UVT)
Estudio: Data Science & Society
Grado: Big Data (880645M3)

Todos documentos para esta materia (3)

Información del documento

Subido en: 28 de octubre de 2025
Número de páginas: 119
Escrito en: 2025/2026
Tipo: Notas de lectura
Profesor(es): Boris čule and stijn rotman
Contiene: Todas las clases

Temas

big data
lecture
practical
extensive
pass exam
summary
notes

Vista previa del contenido

Lecture 1
Concept of big data depends on:
- What kind of data you’re dealing with
- What resources you have
- What you want to do with the data
⇒ No fixed definition, the concept changes over time
- In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyse it

Storing the data:
- Recent trends:
- Storage is (relatively) cheap and easy
- Companies and governments preserve huge amounts of data
- Easier
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories,
patient histories, financial transactions, GPS trajectories, usage logs,
images/ audio/ video, sensor data ⇒ more data being collected
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction (flood warnings), fraud detection,
predictive maintenance (sensor data, output of machine), image recognition,
COVID contract tracing
⇒ Issues:
- The quantity of data
- The speed with which you have to process the data, to produce outputs

Making data useful:
- However:
- Data analysis is computationally intensive and expensive
- Examples:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different
items, independent of the number of transactions (e.g., market basket analysis)
- Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample

So what is big data:
- Dependent on the use case
- Data becomes big data when it becomes too large or too complex to be analyzed with traditional
data analysis software
- Analysis becomes too slow or too unreliable
- Systems become unresponsive (error messages, run out of hard disk space)
- Day-to-day business is impacted

,Three aspects of big data:
- Volume:
- The actual quantity of data that is gathered (gigabytes, etc.) ⇒ how much
data do you have?
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/ transaction
- Can be an issue if there’s too much of it
- Variety:
- The different types of data that is gathered
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data
are irregular
- The variety of data makes the analysis more complex and challenging
- Velocity
- The speed at which new data is coming in and the speed at which data must be handled
- The time intervals of which data comes in
- If the data comes in at a higher speed than you can handle then you’ve a
problem
- Two aspects:
- How fast is new data coming in?
- How fast do you need to handle the new data? (how fast do you need to
produce output?)
- May result in irrecoverable bottlenecks

What can we do about it?
- Invest in hardware
- Store more data ⇒ doesn’t necessarily help with sufficiently speeding up the
computations
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity
- Exponential complexity = if you have to process 100 data items, it takes
2^100 time units
- With more hardware (2 instead of 1 pc) ⇒ 2^99 ⇒ so still a lot
- Linearly reducing the runtime doesn’t help if the run time is exponential
- Design intelligent algorithms to speed up the analysis
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/ time
- Take longer to build a model that can then be used on-the-fly (recommender systems,
precomputed)
- We focus on the latter

,Parallel computing
Goal: leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores of a CPU
- If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the cloud
to do some computations that take 10 hours, but then what?

Goal of parallel processing is to reduce computation time (not to simplify the problem)
- Split the problem into smaller parts and assign these smaller parts to different processors/
different machines
- Algorithms are typically designed to solve a problem in serial fashion
- To fully leverage the power of your multicore CPU you need to adapt your algorithm: split
your problem into smaller parts that can be executed in parallel
- We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e. embarrassingly
parallel
- If an algorithm is embarrassingly parallel, then you can do this and you can achieve
optimal runtime gains
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time
- E.g., task takes 4 hours, give to 4 different machines ⇒ done in 1 hour ⇒ linear
speed up

Example: adding numbers in parallel

- Want to add 8 numbers ⇒ need 7 operations
- Parallel:
- 4 in first step, 2 in second, 1 in third step
- 7 steps turned into four parallel processes ⇒ takes 3 units of time ⇒ not linear
speedup

, Parallel computation:
- Task parallelism: multiple tasks are applied on the same data in parallel
- E.g., you want to do some analysis/ multiple analysis that are independent
from each other ⇒ parallelise this, every machine same data set, but every
machine will do a different task on the data set
- Imagine you have a book (the dataset).
- Person A counts how many words are in the book
- Person B finds the most common word.
- Data parallelism: a calculation is performed in parallel on many different data chunks
- E.g., you want to do a single task on your big data set, divide the data into chunks, and
give each machine a part of the data so that this same task can be processed on each
machine on a different part of the data
- Each machine (or core) gets just a portion of the dataset, and does the same kind of
work.
- Imagine you have a big stack of 1,000 books (the dataset).
- Person A reads books 1–250 and counts the words.
- Person B reads books 251–500 and counts the words.

$9.46

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

StudentSums

3.3

(3)

Conoce al vendedor

StudentSums Erasmus Universiteit Rotterdam

Ver perfil

Seguir

Vendido

Miembro desde

5 año

Número de seguidores

Documentos

Última venta

1 mes hace

3.3

3 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller StudentSums. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $9.46. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now