Concept of big data depends on:
- What kind of data you’re dealing with
- What resources you have
- What you want to do with the data
⇒ No fixed definition, the concept changes over time
- In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyse it
Storing the data:
- Recent trends:
- Storage is (relatively) cheap and easy
- Companies and governments preserve huge amounts of data
- Easier
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories,
patient histories, financial transactions, GPS trajectories, usage logs,
images/ audio/ video, sensor data ⇒ more data being collected
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction (flood warnings), fraud detection,
predictive maintenance (sensor data, output of machine), image recognition,
COVID contract tracing
⇒ Issues:
- The quantity of data
- The speed with which you have to process the data, to produce outputs
Making data useful:
- However:
- Data analysis is computationally intensive and expensive
- Examples:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different
items, independent of the number of transactions (e.g., market basket analysis)
- Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample
So what is big data:
- Dependent on the use case
- Data becomes big data when it becomes too large or too complex to be analyzed with traditional
data analysis software
- Analysis becomes too slow or too unreliable
- Systems become unresponsive (error messages, run out of hard disk space)
- Day-to-day business is impacted
,Three aspects of big data:
- Volume:
- The actual quantity of data that is gathered (gigabytes, etc.) ⇒ how much
data do you have?
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/ transaction
- Can be an issue if there’s too much of it
- Variety:
- The different types of data that is gathered
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data
are irregular
- The variety of data makes the analysis more complex and challenging
- Velocity
- The speed at which new data is coming in and the speed at which data must be handled
- The time intervals of which data comes in
- If the data comes in at a higher speed than you can handle then you’ve a
problem
- Two aspects:
- How fast is new data coming in?
- How fast do you need to handle the new data? (how fast do you need to
produce output?)
- May result in irrecoverable bottlenecks
What can we do about it?
- Invest in hardware
- Store more data ⇒ doesn’t necessarily help with sufficiently speeding up the
computations
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity
- Exponential complexity = if you have to process 100 data items, it takes
2^100 time units
- With more hardware (2 instead of 1 pc) ⇒ 2^99 ⇒ so still a lot
- Linearly reducing the runtime doesn’t help if the run time is exponential
- Design intelligent algorithms to speed up the analysis
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/ time
- Take longer to build a model that can then be used on-the-fly (recommender systems,
precomputed)
- We focus on the latter
,Parallel computing
Goal: leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores of a CPU
- If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the cloud
to do some computations that take 10 hours, but then what?
Goal of parallel processing is to reduce computation time (not to simplify the problem)
- Split the problem into smaller parts and assign these smaller parts to different processors/
different machines
- Algorithms are typically designed to solve a problem in serial fashion
- To fully leverage the power of your multicore CPU you need to adapt your algorithm: split
your problem into smaller parts that can be executed in parallel
- We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e. embarrassingly
parallel
- If an algorithm is embarrassingly parallel, then you can do this and you can achieve
optimal runtime gains
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time
- E.g., task takes 4 hours, give to 4 different machines ⇒ done in 1 hour ⇒ linear
speed up
Example: adding numbers in parallel
- Want to add 8 numbers ⇒ need 7 operations
- Parallel:
- 4 in first step, 2 in second, 1 in third step
- 7 steps turned into four parallel processes ⇒ takes 3 units of time ⇒ not linear
speedup
, Parallel computation:
- Task parallelism: multiple tasks are applied on the same data in parallel
- E.g., you want to do some analysis/ multiple analysis that are independent
from each other ⇒ parallelise this, every machine same data set, but every
machine will do a different task on the data set
- Imagine you have a book (the dataset).
- Person A counts how many words are in the book
- Person B finds the most common word.
- Data parallelism: a calculation is performed in parallel on many different data chunks
- E.g., you want to do a single task on your big data set, divide the data into chunks, and
give each machine a part of the data so that this same task can be processed on each
machine on a different part of the data
- Each machine (or core) gets just a portion of the dataset, and does the same kind of
work.
- Imagine you have a big stack of 1,000 books (the dataset).
- Person A reads books 1–250 and counts the words.
- Person B reads books 251–500 and counts the words.