ANSWER KEY PROVIDED WITH 100% CORRECT
ANSWERS
parallel computing - ANS-simultaneous execution of the same task on multiple processors to obtain
results faster.
parallel computing in 1994 - ANS-weird + expensive parallel machines built; tried to achieve having
shared memory addr space across multiple machines by having number of diff threads accessing same
memory addr space; hard to achieve, slow, hard to program
Donald Becker, Thomas Sterling - ANS-created networking across many low cost machines to allow
parallel computing + give combined performance of more expensive super computer; Beowulf 1
iPhone 5s vs Beowulf 1 - ANS-iPhone 5S is 1000x more powerful than Beowulf 1
pros of Beowulf - ANS-low cost, okay performance
cons of Beowulf - ANS-hard to write single program that uses multiple machines
2 styles of programming emerged for Beowulf - ANS-MPI (message passing interface) + PVM (parallel
virtual machine)
MPI (message passing interface) - ANS-copy + run program on each node in network; explicitly send info
to named other program
cons of MPI - ANS-low level, assembly language; difficult to do
PVM (parallel virtual machine) - ANS-attempt to provide illusion of 1 hunk of memory across all nodes;
easier programming style than MPI
cons of PVM - ANS-speed, software layer giving illusion of contiguous memory was very slow
Beowulf/MPI/PVM improved by: - ANS-better networking
,queuing systems developed
queuing systems - ANS-submit jobs w. requests for resources to frontend, which would control which
nodes run which jobs
whiteboxes, MPI, queuing sys exist now - ANS-make up supercomputers in world i.e. Rivanna cluster at
UVA
Google and others needed to process large amounts of text data to build search engine - ANS-1: single
machine, single disk (easy program, insanely long time to process)
2: buy very large parallel machine (faster than 1, expensive, programs aren't portable, machine breaks
down)
3: MPI + NFS (cheaper than parallel machine, but if 1 box dies no recovery)
4: divide data, process in parallel on multiple machines; combine data at end
NFS (network file system) - ANS-provides illusion of single file system across multiple whiteboxes
GFS (Google File System) - ANS-another FS layered overtop Linux FS
GFS design assumptions - ANS-sys built from many cheap components that fail -- have to deal w.
component failure
sys stored modest # of large files (few million 100 MB files), don't need to optimize for small files
efficiently implement concurrent, atomic appends
high sustained bandwidth is more important than low latency
Linux FS is not designed to - ANS-achieve high sustained bandwitdh at the cost of low latency; Linux
prefers low latency
GFS workload is primarily - ANS-large streaming reads; small random reads; many large sequential
appends
Google File Server stores files via - ANS-1 master, many chunkservers
chunkserver stores - ANS-chunks the file is divided into; multiple servers serve up chunks of file to get
whole file (64 MB chunks)
master stores - ANS-metadata of the file
pros of larger chunk size - ANS-get file back faster because you're asking fewer chunkservers for it
cons of larger chunk size - ANS-more you have to write over
each chunk is - ANS-IDed by globally unique immutable chunk handle
replicated 3x by default for fault tolerance
google file server reads - ANS-master directs client to 1 of the 3 copies (whatever is physically closer to
client to give lower latency)
, google file server writes - ANS-have to store 3 copies of the data before you can tell the client the file is
stored
problem of big data - ANS-harve larger data, w. an ordinary laptop/server computation, HDD speed isn't
fast enough; limiting factor on computer platform is pulling info off hard drive; CPU has no problem
keeping up
How long would count.py take ot run on a file size to 1 TB? (hard drive 150 MB/s) - ANS-2 hours
How long would count.py take to run on file size of 1 GB? (hard drive 150 MB/s) - ANS-1000 MB/ 150
(MB/s) = 7 sec
how can we do big data analysis faster? - ANS-option 1: split TB of data to run in parallel on 1000
machines + combine results by routing all data to a single machine
option 2: map reduce
cons of option 1 - ANS-depends on network speed; overhead form combining results -- speed depends
on combiner algorithm, worst case could still be the same amt of time
map reduce - ANS-A programming model that allows for massive scaling of data often using many
(hundreds, thousands) of servers in Hadoop clusters; has 2 phases: map + reduce
map phase - ANS-data is converted into tuples; purpose to get data out of disk into main mem as simply
+ quickly as possible
word count mapper - ANS-for each word in file: output (word, 1)
reduce phase - ANS-phase 1 machines route output to phase 2 machines; takes sorted list of tuples
output by map phase + combines/analyzes data
word count reducer - ANS-for each word in sorted list: if word is the same as previous, count++; else
output (word, count)
map/reduce routing algorithm - ANS-some data is more common than others, so one machine gets hit
much more than others; routing algorithm distributes map data evenly among phase 2 machines so they
all complete around the same time
hadoop - ANS-software sys created by someone at Yahoo attempting to recreate Google's Map Reduce
method
HDFS (Hadoop Distributed File System) - ANS-version of GFS implemented by Hadoop
map reduce strategy originated - ANS-at Google, to help them calculate page rank for a search
Spark was created to - ANS-overcome fundamental performance issues of hadoop
pros of map/reduce - ANS-speed (shorter/faster); phase 2 is 4x faster than alternative