100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Statistics & Methodology 2017/2018 - Summary

Beoordeling
-
Verkocht
8
Pagina's
24
Geüpload op
10-01-2018
Geschreven in
2017/2018

Summary Statistics & Methodology Data Science Logistic Linear Regression Correlation Distribution Centering Estimates Error R Rstudio











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
10 januari 2018
Aantal pagina's
24
Geschreven in
2017/2018
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Statistics & Methodology
summary
In general
- Purpose of statistics: systematize the way we account for uncertainty when making data-
based decisions.
- High variance (high standard deviation) » do not draw conclusions based on Mdif
- Data Scientist: raw information » data analytic techniques » actionable knowledge
- Do not over-state finding, when presenting results » could lead to a waste of time/money

Probability Distributions
- PD’s quantify how likely is each possible value of some
probabilistic entity
- PD’s are re-scaled frequency distributions
- Big population » histogram turns into a continuous
‘smooth’ curve (total area below: 1.0)

Statistical Testing
- Distil information and control for uncertainty; weigh estimated effect by its precision
- Common type of statistical test, Wald Test: T = Estimate / Variability
- Need to compare the test statistics to some objective reference to conduct the test
- This objective reference – sampling distribution – tells us how exceptional our test is.

Sampling Distribution
- SD is simply the probability distribution of a parameter
o Population is defined by infinite sequence of repeated tests
o SD quantifies the possible values of test statistic over infinite repeated sampling
o Each point on curve represents probability of observing corresponding test statistic
- Sampling distribution ≠ random variable distribution
o SD: quantifies possible values of a statistic (mean, t-statistic, correlation coefficient)
o RVD: quantifies possible values of a variable (age, gender, income, food type)
o SD of T-statistic: draw samples repeatedly from RVD, re-compute T each time
- How exceptional is our estimated t-statistic?
o Compare value of SD of t-statistic assuming no effect (null hypothesis)
o When estimated statistic would be very unusual in a population where the null
hypothesis is true, we reject the null and claim a ‘statistically significant’ effect.
- Computing the probability of events
o Area of corresponding slice from the distribution

P-values
- Calculating the area in null distribution that exceeds
estimated test statistic (5% » 0.05)
o Compute probability of observing given test statistic (or one more extreme) if null
hypothesis is true.

, o Compute probability of having sampled data we observed (or more unusual data)
from a population wherein there is no true mean difference in ratings.

In R




Conclusions
- A considerate evaluation of uncertainty is crucial to any responsible data analysis.
- Even in situations where you may be analysing the entire ‘population’, you’ll need statistical
inference to make reliable projections of future outcomes.
- For simple questions we can use statistical testing to control for uncertainty!



Statistical Modelling
- Statistical testing quickly reaches a limit
- Real-world ‘messiness’ is controlled through random assignment » knowledge generalisation
- Data scientists normally work with messy observational data instead of conduct experiments
- Model: mathematical representation of data distribution
- ^Y = ^B0 + ^B1*X

, Data Model
- Different than algorithmic model
- Modular model, built from probability distributions
- Encode our hypothesised understanding of the system we’re
exploring
- Constructed in a ‘top-down’ theory-driven way

Regression Problem
- Opposite of classification problems
- Has input (X) and output (Y), involves quantitative response
- Simple mean comparison » regression

Probability Distribution
- Unconditional (or marginal) distribution:
o Expected value of Y is the same for each observation
- Conditional distribution:
o Expected value of Y for each observation is defined by
observations’ characteristics

Simple Linear Regression
- The best fit line: ^Y = ^B0 + ^B1*X + e
o ^B0 » intercept » expected value Y, when X = 0
o ^B1 » slope » expected change in Y, for X += 1
o e » estimation error » (Y - ^Y)
- Regression coefficients
o Find best fit line
o Most popular: Residual Sum of Squares (RSS)
(Y - ^Y)²

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
JHessels Tilburg University
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
49
Lid sinds
7 jaar
Aantal volgers
33
Documenten
9
Laatst verkocht
1 jaar geleden

2,5

6 beoordelingen

5
0
4
1
3
3
2
0
1
2

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen