The 4 V's of Data - ANS-Volume (amount of data)
Variety
Velocity (real time data)
Veracity (noise, missing data, errors)
Predictive Modeling Pipeline - ANS-1. Prediction Target
2. Cohort Construction
3. Feature Construction
4. Feature Selection
5. Predictive Model
6. Performance Evaluation
New cases of heart failure that occurs each year in the US - ANS-550,000
Prospective vs Retrospective Studies - ANS-Prospective: Identify cohort -> collect data
Retrospective: Collect data -> identify cohort
Case patients - ANS-have the condition you're trying to predict
Mapreduce - ANS-It is:
- a programming model where the developer can specify parallel computation algorithms
- an execution environment (hadoop is the Java implementation of MapReduce and HDFS)
- a software package
It provides:
- Distributed storage
- Distributed computation
- Fault tolerance
Mapreduce system - ANS-has 2 components - mappers, and reducers
all the data with be partitioned and processed by multiple mappers (and it pre-aggregates the
data)
shuffle stage - mapper results are sent to the reducers
the reducers process the intermediate (mapper) results (ex. one reducer for heart disease,
another for cancer, etc.)
, Mapreduce fault recovery - ANS-if mapper 2 fails during execution of the mapreduce program,
then the mapreduce system will restart mapper 2 and go through the same workload again to
make sure it doesn't fail
(this same process happens for reducers)
Mapreduce KNN - ANS-Map()
Input:
- all points
- query point p
Output:
- k nearest neighbors
Emit the k closest points to p
Reduce() - goes through all the local nearest neighbors to identify the global nearest neighbors
to p
Input:
- key: null
- values: local neighbors
- query point p
Output:
- k nearest neighbots
Emit the k closest points to p among all local neighbors
Mapreduce linear regression - ANS-see notes
Limitations of MapReduce - ANS-MapReduce is not optimized for iteration and multi-stage
computation
- Logistic regression is hard to implement
Iterative batch gradient descent is hard to implement in MapReduce (it's not efficient)
MapReduce optimal setup - ANS-Single Pass (ex. computing histograms)
Uniformly-distributed keys (if it is skewed, then one reducer has to do almost all the jobs)
No synchronization needed (the only synchronization MapReduce has is between map and
reduce phase)