100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Digitization & Big Data Analytics

Rating
-
Sold
4
Pages
51
Uploaded on
25-05-2021
Written in
2020/2021

This is a summary of Digitization & Big Data Analytics. It contains all relevant content for the exam.

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
May 25, 2021
Number of pages
51
Written in
2020/2021
Type
Summary

Subjects

Content preview

Advancing society summary

Lecture 1
No important data

Lecture 2
History
In 1902 the Antikythere mechanism (or astrolabes) was discovered. This is an ancient Griek
analog computer used to predict astronomical positions and eclipses decades in advance and it
was used to track the cycle of the Olympic games.

The turing machine was made by Alan turing who is considered the father of theoretical
computer science and artificial intelligence. He influenced the development of theoretical
computer science and formalized the concept of algorithm and computation with the turing
machine, which is a model of a general-purpose computer. That is a computer that is able to
perform most common computing tasks.

The ENIAC was the first electronic general-purpose computer. It was Turing-complete and
able to solve a large class of numerical problems through programming.

Server - Client




-> Client requests content or service from the central server who redirects this to other servers
where the content or service is allocated. This is than brought back to the client.

GFS and MapReduce
GFS – Google File System
• Proprietary distributed file system developed by Google to provide efficient, reliable access
to data using large clusters of commodity hardware (easily obtainable hardware).
MapReduce
• Programming model and associated implementation for processing and generating large
datasets with a parallel distributed algorithm on a cluster. First you have the map task, where
the data is read and processed to produce key-value pairs as intermediate outputs. The output
of a mapper is input of the reducer. The reducer receives the key-value pair from multilple

,map jobs. Then, the reduces aggregates those intermediata data tuples ( key-value pair) into a
smaller set of tuples which is the final ouput.
Example: Imaging your dataset as a Lego model, broken down into its pieces, and then
distributed to different locations, where there are also other pieces of Lego models that you
don’t care about. MapReduce provides a framework for distributed algorithms that enables
you to write code for clusters which can put together the pieces you need for your analysis, by
finding the pieces in the remote locations (Map) and bringing them together back as a Lego
model (Reduce).

Appache hadoop
Intro
-Open source project – Solution for Big Data
• Deals with complexities of high Volume, Velocity and Variety of data
- Not a SINGLE Open Source Project
• Ecosystem of Open Source Projects
• Work together to provide set of services
• It uses MapReduce
- Transforms standard commodity hardware into a service:
• Stores Big Data reliably (PB of data)
• Enables distributed computations
• Enourmous processing power and able to solve problems involving massive amounts
of data and computation.
-Large Hadoop cluster can have:
• 25 PB of data
• 4500 machines
-Story behind the name
• Doug Cutting, Chief Architect of Cloudera and one of the creators of Hadoop. It is
named after the stuffed yellow elepehant his 2 year old called “Hadoop”.

Key attributes of Hadoop
-Redundant and Reliable
• No risk of data loss
- Powerful
• All machines available
- Batch Processing
• Some pieces in real-time
• Submit job get results when done
- Distributed applications
• Write and test on one machine
• Scale to the whole cluster
- Runs on commodity hardware
• No need for expensive servers

,Hadoop architecture
-MapReduce
• Processing part of Hadoop
• Managing the processing tasks
• Submit the tasks to MapReduce
-HDFS
• Hadoop Distributed File System
• Stores the data
• Files and Directories
• Scales to many PB
- TaskTracker
• The MapReduce server
• Launching MapReduce tasks on the machine
- DataNode
• The HDFS Server
• Stores blocks of data
• Keep tracks of data
• Provides high bandwidt
How does it work?
-Structure
• Multiple machines with Hadoop create a cluster
• Replicate Hadoop installation in multiple machines
• Scale according to needs in a linear way
• Add nodes based on specific needs
• Storage
• Processing
• Bandwidth
-Task coordinator
• Hadoop needs a task coordinator
• JobTracker tracks running jobs
• Divides jobs into tasks
• Assigns tasks to each TaskTracker
• TaskTracker reports to JobTracker
• Running
• Completed
• JobTracker is responsible for TaskTracker status
• Noticing whether is online or not
• Assigns its tasks to another node
-Data coordinator
• Hadoop needs a data coordinator
• NameNode keeps information of data (al)location
• Talks directly to DataNodes for read and write
• For write permission and network architecture

, • Data never flow through the NameNode
• Only information ABOUT the data
• NameNode is responsible for DataNodes status
• Noticing whether is online or not
• Replicates the data on another node
-Automatic Failover
• When there is software or hardware failure
• Nodes on the cluster reassign the work of the failed node
• NameNode is responsible for DataNode status
• JobTracker is responsible for TaskTracker status
Characteristics
• Reliable and Robust
• Data replication on multiple DataNodes
• Tasks that fail are reassigned / redone
• Scalable
• Same code runs on 1 or 1000 machines
• Scales in a linear way
• Simple APIs available
• For Data
• For apps
• Powerful
• Process in parallel PB of data

In short:
• Hadoop is a layer between software and hardware that enables building computing clusters
on commodity hardware, based on an architecture that provides redundancy. The architecture
includes a NameNode, which is a data coordinator and “talks” to the DataNode of each
machine, which is the HDFS server. Also, it includes a JobTracker that “talks” to the
TaskTracker of each machine, which is the MapReduce server.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
bascrypto Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
323
Member since
4 year
Number of followers
208
Documents
18
Last sold
4 days ago

4.2

49 reviews

5
25
4
13
3
7
2
2
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions