100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary 2020 Data Science & Society Final Exam Preparation

Rating
-
Sold
2
Pages
19
Uploaded on
29-10-2020
Written in
2020/2021

Data Science & Society summary. For this Summary I used the Materials of the Lecture from Period 1 2020 and the Book of Igual & Segui(2017) and Hutter et al (2019)

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Connected book

Written for

Institution
Study
Course

Document information

Summarized whole book?
No
Which chapters are summarized?
1-11
Uploaded on
October 29, 2020
Number of pages
19
Written in
2020/2021
Type
Summary

Subjects

Content preview

Data Science & Society Summary Final Exam

Data Science Purposes:
 Probing Reality, Pattern Discovery, Predicting Future Events, Understanding People and the
world


Crisp-DM Model steps: 1-2(-1)-2-3-4(-3)-4-5-(-1)6
Mapping: Phases-Generic Tasks(Crisp PM) – Specialized Tasks-Process Instances(Crisp Pr.)

,Failure Indicator of Dashboard design:
o Too Flat(no sup t.exp.vis.highlighted probl.), Too Manual (automat. Collect + del. information),
Too Isolated(no view of whole system conf. or tunnel vision)
o Three Layers indicators for dashboards(Monitoring, Analysing + Drill Down)
Hadoop:




Reasons for using Hadoop: Moving Computat. To data (scheme on read style), scalability, reliability
MapReduceLayer: Jobtracker, TaskTracker(MasterNode),TaskTracker(SlaveNode)
HDFS Layer: NameNode, DataNode(MasterNode). DataNode(SlaveNode)
HDFS Takeaways: Master Slave Architecture, Cluster has single name node, User Data never flows
through NameNode
o Fault-tolerant (without RAID, with commodity Hardware)
o Between Name+DataNode (Heartbeats, Replication, Balancing)
MapReduce:

, o Brings compute to data in contrast to trad. Parallelism
o Store replicated + distributed data in HDFS (in chunks, stored on sev. Compute nodes)
o Ideal for operations on large flat datasets
o Mapper: transforms into key-value pairs multiple key, val. Pairs may occur.
o Reducer, Transforms every key,val pair with comm key into single key with a single value
YARN: enhances power of Hadoop Cluster:
o Scalability with multi tenancy, cluster utilization + MapReduce Compability + Support for other
workloads (Graphic Processing, Iterative Modelling)
o Splits up Resource Management + Job Scheduling into 2 sep. units.  one global resource
manager + per application master manager
Motivation MapReduce:
o Ever growing data, processing with more processing power, access + transport of lots of data
o Data need updating  use RDBMS, Need to skim through data  Take Computation to Data
MapReduce in Hadoop:
o User def. a map function + Hadoop replicates map to data(key,val output)
o Hadoop shuffles + groups key, val data, user defines reduce function + Hadoop distributes groups
to reducer.
MR Design Consideration:
o Composite Keys, Extra Info in Values, Cascade MapReduce Jobs, Aggregate Map Output when
possible
o Limitations for MR: Must fit key,val, MR data not persistne, Requires Programming, Not
Interactive
NOSQL Features:
o Horizontal scalability, replication over many servers, simple call level, weaker concurrency
model, efficient use of indizes, ab. To dynamic add new data records.
Arguments for SQL: Arguments for NOSQL:
o Can do everything a NOSQL system can o No benchmarks, that show scaling is ach
o Majority Market Share with SQL
o Built to handle other application loads o Easier to understand
o Common Interface o Flexible schmea
o To easy in SQL for Mulitnode operations
o Need for part. Capab.


Population: a population is a collection of objects, items (“units”) about which information is
Sought
Sample: a sample is a part of the population that is observed
Data Preparation steps: Obtaining the data -Parsing the data- Cleaning the data -Building data structures
Population Mean: An abstract concept that does not elaborate further
Average: Not strictly defined

, Mean of a Sample: Sum of Values divided by the Count
Variance: Spread of the data
Standard deviation: Square Root of mean/averaged by number of count

For Small number of sample std is biased: solution
Sample Median: Robust against outliers, Values ordered by magnitude, middle of ordered list
Ascombes Quadrangle: Descriptive statistics could be the same however, the plot can be very different.
Histogram: Shows frequency of values
PMF: Normalization of Histogram by dividing by number of samples
CDF: Descr. Prob. That real value random var X is less or equal of x.
Skewness: Negative – skews left, more datapoints left, Positive – skews right, more datapoints there,
alternative: Pearson’s median coefficient
Exponential distribution: λ defines the shape of the distribution, mean is 1/ λ, variance, 1/ λ^2, median ln(2)/ λ

Standard Score: Normalization of Data

Covariance: If two shared vars share the same tendency – COV itself hard to interpret

Pearson’s Correlation: Normalization of data in respect to their deviation:
Spearmans Rank Correlation: Adresess Robustness problem when data contains outliers. Differences of
values between sets.
Frequentist Approach: Assume that there’s a population that can be represented by sev. Parameters,
param. Are fixed but not vis to the population. Way to estimate is to take a sample
Bayesian Approach: Assume that data is fixed, but not the result of samling process, but describing data
can be done proababistically. Bays. Appr. Focus on prod parameter distr. That represent all the
knowledge that can be extracted
Problem faced when varying from one sample to another: will not be equal to the parameter of interest:

compute standard error or standard deviation of mean σx¯,:
Computational Intencise: Bootstraping: Drawing n obersv. With replacement. Then calculate mean of
this.
Confidence Interval: Plausible range of values, plausibility defined from sampling distribution ex: C I = [Θ -
1.96 × SE, Θ + 1.96 × SE] for Θ ± z × SE

95% CI: 5% of the interval does not contain the true mean

P-Value: Prob. Of obs. Data at least as favorable to the alternative hpythesis if the null hypothesis is true: Means: Given a
sample and an apparent effect, what is the prob of seeing such an effect by chance?

Supervised Learning: Alg. That learn from labelled example to gen. to set of all poss. Inputs.
$8.38
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
crperling

Get to know the seller

Seller avatar
crperling Universiteit Utrecht
Follow You need to be logged in order to follow users or courses
Sold
2
Member since
5 year
Number of followers
2
Documents
1
Last sold
3 year ago

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions