Summary

Summary Digitization & Big Data Analytics

Rating

Sold

Pages

Uploaded on

25-05-2021

Written in

2020/2021

This is a summary of Digitization & Big Data Analytics. It contains all relevant content for the exam.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Tilburg University (UVT)
Study: Bedrijfseconomie
Course: Digitization & Big Data Analytics (30K217B6)

All documents for this subject (2)

Document information

Uploaded on: May 25, 2021
Number of pages: 51
Written in: 2020/2021
Type: Summary

Subjects

advancing society
digitization
big data analytics
digitization amp big data analytics
digitization and big data analytics
uvt
tilburg university

Content preview

Advancing society summary

Lecture 1
No important data

Lecture 2
History
In 1902 the Antikythere mechanism (or astrolabes) was discovered. This is an ancient Griek
analog computer used to predict astronomical positions and eclipses decades in advance and it
was used to track the cycle of the Olympic games.

The turing machine was made by Alan turing who is considered the father of theoretical
computer science and artificial intelligence. He influenced the development of theoretical
computer science and formalized the concept of algorithm and computation with the turing
machine, which is a model of a general-purpose computer. That is a computer that is able to
perform most common computing tasks.

The ENIAC was the first electronic general-purpose computer. It was Turing-complete and
able to solve a large class of numerical problems through programming.

Server - Client

-> Client requests content or service from the central server who redirects this to other servers
where the content or service is allocated. This is than brought back to the client.

GFS and MapReduce
GFS – Google File System
• Proprietary distributed file system developed by Google to provide efficient, reliable access
to data using large clusters of commodity hardware (easily obtainable hardware).
MapReduce
• Programming model and associated implementation for processing and generating large
datasets with a parallel distributed algorithm on a cluster. First you have the map task, where
the data is read and processed to produce key-value pairs as intermediate outputs. The output
of a mapper is input of the reducer. The reducer receives the key-value pair from multilple

,map jobs. Then, the reduces aggregates those intermediata data tuples ( key-value pair) into a
smaller set of tuples which is the final ouput.
Example: Imaging your dataset as a Lego model, broken down into its pieces, and then
distributed to different locations, where there are also other pieces of Lego models that you
don’t care about. MapReduce provides a framework for distributed algorithms that enables
you to write code for clusters which can put together the pieces you need for your analysis, by
finding the pieces in the remote locations (Map) and bringing them together back as a Lego
model (Reduce).

Appache hadoop
Intro
-Open source project – Solution for Big Data
• Deals with complexities of high Volume, Velocity and Variety of data
- Not a SINGLE Open Source Project
• Ecosystem of Open Source Projects
• Work together to provide set of services
• It uses MapReduce
- Transforms standard commodity hardware into a service:
• Stores Big Data reliably (PB of data)
• Enables distributed computations
• Enourmous processing power and able to solve problems involving massive amounts
of data and computation.
-Large Hadoop cluster can have:
• 25 PB of data
• 4500 machines
-Story behind the name
• Doug Cutting, Chief Architect of Cloudera and one of the creators of Hadoop. It is
named after the stuffed yellow elepehant his 2 year old called “Hadoop”.

Key attributes of Hadoop
-Redundant and Reliable
• No risk of data loss
- Powerful
• All machines available
- Batch Processing
• Some pieces in real-time
• Submit job get results when done
- Distributed applications
• Write and test on one machine
• Scale to the whole cluster
- Runs on commodity hardware
• No need for expensive servers

,Hadoop architecture
-MapReduce
• Processing part of Hadoop
• Managing the processing tasks
• Submit the tasks to MapReduce
-HDFS
• Hadoop Distributed File System
• Stores the data
• Files and Directories
• Scales to many PB
- TaskTracker
• The MapReduce server
• Launching MapReduce tasks on the machine
- DataNode
• The HDFS Server
• Stores blocks of data
• Keep tracks of data
• Provides high bandwidt
How does it work?
-Structure
• Multiple machines with Hadoop create a cluster
• Replicate Hadoop installation in multiple machines
• Scale according to needs in a linear way
• Add nodes based on specific needs
• Storage
• Processing
• Bandwidth
-Task coordinator
• Hadoop needs a task coordinator
• JobTracker tracks running jobs
• Divides jobs into tasks
• Assigns tasks to each TaskTracker
• TaskTracker reports to JobTracker
• Running
• Completed
• JobTracker is responsible for TaskTracker status
• Noticing whether is online or not
• Assigns its tasks to another node
-Data coordinator
• Hadoop needs a data coordinator
• NameNode keeps information of data (al)location
• Talks directly to DataNodes for read and write
• For write permission and network architecture

, • Data never flow through the NameNode
• Only information ABOUT the data
• NameNode is responsible for DataNodes status
• Noticing whether is online or not
• Replicates the data on another node
-Automatic Failover
• When there is software or hardware failure
• Nodes on the cluster reassign the work of the failed node
• NameNode is responsible for DataNode status
• JobTracker is responsible for TaskTracker status
Characteristics
• Reliable and Robust
• Data replication on multiple DataNodes
• Tasks that fail are reassigned / redone
• Scalable
• Same code runs on 1 or 1000 machines
• Scales in a linear way
• Simple APIs available
• For Data
• For apps
• Powerful
• Process in parallel PB of data

In short:
• Hadoop is a layer between software and hardware that enables building computing clusters
on commodity hardware, based on an architecture that provides redundancy. The architecture
includes a NameNode, which is a data coordinator and “talks” to the DataNode of each
machine, which is the HDFS server. Also, it includes a JobTracker that “talks” to the
TaskTracker of each machine, which is the MapReduce server.

$5.39

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

bascrypto

4.2

(49)

Get to know the seller

bascrypto Tilburg University

View profile

Sold

323

Member since

4 year

Number of followers

208

Documents

Last sold

4 days ago

4.2

49 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller bascrypto. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $5.39. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 50201 documents were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now

Summary Digitization & Big Data Analytics

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay how you prefer, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying this summary from?

Will I be stuck with a subscription?

Can Stuvia be trusted?