100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary Digitization & Big Data Analytics

Beoordeling
-
Verkocht
4
Pagina's
51
Geüpload op
25-05-2021
Geschreven in
2020/2021

This is a summary of Digitization & Big Data Analytics. It contains all relevant content for the exam.












Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
25 mei 2021
Aantal pagina's
51
Geschreven in
2020/2021
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Advancing society summary

Lecture 1
No important data

Lecture 2
History
In 1902 the Antikythere mechanism (or astrolabes) was discovered. This is an ancient Griek
analog computer used to predict astronomical positions and eclipses decades in advance and it
was used to track the cycle of the Olympic games.

The turing machine was made by Alan turing who is considered the father of theoretical
computer science and artificial intelligence. He influenced the development of theoretical
computer science and formalized the concept of algorithm and computation with the turing
machine, which is a model of a general-purpose computer. That is a computer that is able to
perform most common computing tasks.

The ENIAC was the first electronic general-purpose computer. It was Turing-complete and
able to solve a large class of numerical problems through programming.

Server - Client




-> Client requests content or service from the central server who redirects this to other servers
where the content or service is allocated. This is than brought back to the client.

GFS and MapReduce
GFS – Google File System
• Proprietary distributed file system developed by Google to provide efficient, reliable access
to data using large clusters of commodity hardware (easily obtainable hardware).
MapReduce
• Programming model and associated implementation for processing and generating large
datasets with a parallel distributed algorithm on a cluster. First you have the map task, where
the data is read and processed to produce key-value pairs as intermediate outputs. The output
of a mapper is input of the reducer. The reducer receives the key-value pair from multilple

,map jobs. Then, the reduces aggregates those intermediata data tuples ( key-value pair) into a
smaller set of tuples which is the final ouput.
Example: Imaging your dataset as a Lego model, broken down into its pieces, and then
distributed to different locations, where there are also other pieces of Lego models that you
don’t care about. MapReduce provides a framework for distributed algorithms that enables
you to write code for clusters which can put together the pieces you need for your analysis, by
finding the pieces in the remote locations (Map) and bringing them together back as a Lego
model (Reduce).

Appache hadoop
Intro
-Open source project – Solution for Big Data
• Deals with complexities of high Volume, Velocity and Variety of data
- Not a SINGLE Open Source Project
• Ecosystem of Open Source Projects
• Work together to provide set of services
• It uses MapReduce
- Transforms standard commodity hardware into a service:
• Stores Big Data reliably (PB of data)
• Enables distributed computations
• Enourmous processing power and able to solve problems involving massive amounts
of data and computation.
-Large Hadoop cluster can have:
• 25 PB of data
• 4500 machines
-Story behind the name
• Doug Cutting, Chief Architect of Cloudera and one of the creators of Hadoop. It is
named after the stuffed yellow elepehant his 2 year old called “Hadoop”.

Key attributes of Hadoop
-Redundant and Reliable
• No risk of data loss
- Powerful
• All machines available
- Batch Processing
• Some pieces in real-time
• Submit job get results when done
- Distributed applications
• Write and test on one machine
• Scale to the whole cluster
- Runs on commodity hardware
• No need for expensive servers

,Hadoop architecture
-MapReduce
• Processing part of Hadoop
• Managing the processing tasks
• Submit the tasks to MapReduce
-HDFS
• Hadoop Distributed File System
• Stores the data
• Files and Directories
• Scales to many PB
- TaskTracker
• The MapReduce server
• Launching MapReduce tasks on the machine
- DataNode
• The HDFS Server
• Stores blocks of data
• Keep tracks of data
• Provides high bandwidt
How does it work?
-Structure
• Multiple machines with Hadoop create a cluster
• Replicate Hadoop installation in multiple machines
• Scale according to needs in a linear way
• Add nodes based on specific needs
• Storage
• Processing
• Bandwidth
-Task coordinator
• Hadoop needs a task coordinator
• JobTracker tracks running jobs
• Divides jobs into tasks
• Assigns tasks to each TaskTracker
• TaskTracker reports to JobTracker
• Running
• Completed
• JobTracker is responsible for TaskTracker status
• Noticing whether is online or not
• Assigns its tasks to another node
-Data coordinator
• Hadoop needs a data coordinator
• NameNode keeps information of data (al)location
• Talks directly to DataNodes for read and write
• For write permission and network architecture

, • Data never flow through the NameNode
• Only information ABOUT the data
• NameNode is responsible for DataNodes status
• Noticing whether is online or not
• Replicates the data on another node
-Automatic Failover
• When there is software or hardware failure
• Nodes on the cluster reassign the work of the failed node
• NameNode is responsible for DataNode status
• JobTracker is responsible for TaskTracker status
Characteristics
• Reliable and Robust
• Data replication on multiple DataNodes
• Tasks that fail are reassigned / redone
• Scalable
• Same code runs on 1 or 1000 machines
• Scales in a linear way
• Simple APIs available
• For Data
• For apps
• Powerful
• Process in parallel PB of data

In short:
• Hadoop is a layer between software and hardware that enables building computing clusters
on commodity hardware, based on an architecture that provides redundancy. The architecture
includes a NameNode, which is a data coordinator and “talks” to the DataNode of each
machine, which is the HDFS server. Also, it includes a JobTracker that “talks” to the
TaskTracker of each machine, which is the MapReduce server.

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
bascrypto Tilburg University
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
323
Lid sinds
4 jaar
Aantal volgers
208
Documenten
18
Laatst verkocht
3 dagen geleden

4,2

49 beoordelingen

5
25
4
13
3
7
2
2
1
2

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen