100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Statistics & Methodology 2017/2018 - Summary

Rating
-
Sold
8
Pages
24
Uploaded on
10-01-2018
Written in
2017/2018

Summary Statistics & Methodology Data Science Logistic Linear Regression Correlation Distribution Centering Estimates Error R Rstudio

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
January 10, 2018
Number of pages
24
Written in
2017/2018
Type
Summary

Subjects

Content preview

Statistics & Methodology
summary
In general
- Purpose of statistics: systematize the way we account for uncertainty when making data-
based decisions.
- High variance (high standard deviation) » do not draw conclusions based on Mdif
- Data Scientist: raw information » data analytic techniques » actionable knowledge
- Do not over-state finding, when presenting results » could lead to a waste of time/money

Probability Distributions
- PD’s quantify how likely is each possible value of some
probabilistic entity
- PD’s are re-scaled frequency distributions
- Big population » histogram turns into a continuous
‘smooth’ curve (total area below: 1.0)

Statistical Testing
- Distil information and control for uncertainty; weigh estimated effect by its precision
- Common type of statistical test, Wald Test: T = Estimate / Variability
- Need to compare the test statistics to some objective reference to conduct the test
- This objective reference – sampling distribution – tells us how exceptional our test is.

Sampling Distribution
- SD is simply the probability distribution of a parameter
o Population is defined by infinite sequence of repeated tests
o SD quantifies the possible values of test statistic over infinite repeated sampling
o Each point on curve represents probability of observing corresponding test statistic
- Sampling distribution ≠ random variable distribution
o SD: quantifies possible values of a statistic (mean, t-statistic, correlation coefficient)
o RVD: quantifies possible values of a variable (age, gender, income, food type)
o SD of T-statistic: draw samples repeatedly from RVD, re-compute T each time
- How exceptional is our estimated t-statistic?
o Compare value of SD of t-statistic assuming no effect (null hypothesis)
o When estimated statistic would be very unusual in a population where the null
hypothesis is true, we reject the null and claim a ‘statistically significant’ effect.
- Computing the probability of events
o Area of corresponding slice from the distribution

P-values
- Calculating the area in null distribution that exceeds
estimated test statistic (5% » 0.05)
o Compute probability of observing given test statistic (or one more extreme) if null
hypothesis is true.

, o Compute probability of having sampled data we observed (or more unusual data)
from a population wherein there is no true mean difference in ratings.

In R




Conclusions
- A considerate evaluation of uncertainty is crucial to any responsible data analysis.
- Even in situations where you may be analysing the entire ‘population’, you’ll need statistical
inference to make reliable projections of future outcomes.
- For simple questions we can use statistical testing to control for uncertainty!



Statistical Modelling
- Statistical testing quickly reaches a limit
- Real-world ‘messiness’ is controlled through random assignment » knowledge generalisation
- Data scientists normally work with messy observational data instead of conduct experiments
- Model: mathematical representation of data distribution
- ^Y = ^B0 + ^B1*X

, Data Model
- Different than algorithmic model
- Modular model, built from probability distributions
- Encode our hypothesised understanding of the system we’re
exploring
- Constructed in a ‘top-down’ theory-driven way

Regression Problem
- Opposite of classification problems
- Has input (X) and output (Y), involves quantitative response
- Simple mean comparison » regression

Probability Distribution
- Unconditional (or marginal) distribution:
o Expected value of Y is the same for each observation
- Conditional distribution:
o Expected value of Y for each observation is defined by
observations’ characteristics

Simple Linear Regression
- The best fit line: ^Y = ^B0 + ^B1*X + e
o ^B0 » intercept » expected value Y, when X = 0
o ^B1 » slope » expected change in Y, for X += 1
o e » estimation error » (Y - ^Y)
- Regression coefficients
o Find best fit line
o Most popular: Residual Sum of Squares (RSS)
(Y - ^Y)²

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
JHessels Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
49
Member since
7 year
Number of followers
33
Documents
9
Last sold
1 year ago

2.5

6 reviews

5
0
4
1
3
3
2
0
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions