Samenvatting

Summary and Study Guide for Statistics and Methodology (25 Spring)

Beoordeling

Verkocht

Pagina's

Geüpload op

24-03-2025

Geschreven in

2024/2025

see details in preview pages

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Data Science & Society
Vak: 880670-M-6 (880670M6)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 24 maart 2025
Aantal pagina's: 16
Geschreven in: 2024/2025
Type: Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Statistics DSS (25spring)
Notes & Study Guide

Number in tle refers to corresponding Module
(order of module is adjusted for clearer structure)

With sign: given quiz & exam sample ques ons

To pass or get a good score in ﬁnal exam, it is strongly
recommended to thoroughly engage with the material
and gain a deep understanding of the concepts and
terms.

Any ques on, please email to:

Version: 202503201719
By: Alice

, Statistical reasoning(1)
Sta s cal reasoning the founda on of all good sta s cal analyses is a deliberate careful, and thorough
considera on of uncertainty.
The purpose of sta s cs to systema ze the way that we account for uncertainty when making data-based
decisions
No need to memorize any formulas the larger the test sta s cs, the be er. (In general)
unless teacher speciﬁcally says so
(Week 1 Basic 3, 02:51) (Sta s cs is all about what we can see as human and how we decide based on what we see,
it is not about how the real world actually is)

Probability distribu on quan fy how likely it is to observe each possible value of some probabilis c en ty
(re-scaled frequency distribu ons) (e.g. height, the outcome variable)
Sta s cal Tes ng dis ll info into a simple sta s c to make a judgement, we weight the es mated eﬀect
by the precision of the es mate.

Wald Test 𝑇=

Nil-null a null hypothesis of no eﬀect Very Important: the possible value of test sta s c

t-test a way to summarize the comparison of two variables’ distribu on
the t-sta s c also has a sampling distribu on that quan ﬁes the possible t-values we
could get if we repeatedly drew samples from the variables’ distribu on and re-
computed a t-sta s c each me.

Direc onal hypothesis

NOT a test sta s c P value
𝑃(𝑡 = 𝑡̂|𝐻 ) = 0 (the probability of observing any individual point on a con nuous
(Week 1 Part 1 Quiz 2) distribu on is exactly zero.)
CAN NOT say There is a 0.032 probability that the true mean diﬀerence is greater than zero.
There is a 0.032 probability that the null hypothesis is false.
There is a 0.032 probability that the observed result is due to chance alone.
There is a 0.032 probability of replica ng the observed eﬀect in the future.
There is a 0.032 probability of observing 𝑡̂, if the null hypothesis is true.

How do we interpret the p value then? There is a 0.032 probability of observing a test sta s c at least as large as 𝒕, if the null
hypothesis is true. 𝑃(𝑡 ≥ 𝑡̂|𝐻 )

What do we want to know? The inversed ques on: what is the probability of the null hypothesis is true, given that
*not possible with null-hypothesis tes ng a t-sta s c is larger or equal than the es mated one?
possible only with Bayesian sta s cs

In what scenario do we use in experimental contexts.
sta s cal tes ng? While real world has messy observa onal data, has no control for confounding factors.

We need sta s cal modeling.

Sta s cal modeling build a mathema cal representa on of the (interes ng aspects) of a data distribu on
-> learn the important features of a distribu on (without a prior hypothesis)
Data Science Cycle (4 essen al steps) deﬁne a problem – collect data – process data – clean data (slide week 1 design p4)
Collec ng own data is NOT always preferred over secondary data (week 1 part 2 quiz 2)

EDA (Exploratory data analysis) 1) mindset than techniques/steps; 2) contrast with strict empiricist hypothesis tes ng
3) be used to generate hypotheses for CDA; 4) sanity check hypotheses, if fail, reject.
(can’t modify hypotheses based on these sanity checks and s ll test new hypotheses with the same data)
5) if don’t care about tes ng hypotheses, focus on EDA.
CDA (conﬁrmatory data analysis) 1) if data are well-understood, proceed directly to CDA;
*CDA and EDA can NOT stand alone

, Outliers(2)

What is univariate outlier? Extreme values with respect to the distribu on of a variable’s other observa ons
- illegal value: data entry errors (most common cause)
- legal value: extreme values (e.g. a person 3-meter high)

We choose to view an outlier as arising from a diﬀerent popula on than the one to
which we want to generalize our ﬁndings.
What are the methods to diagnose
poten al outliers?
1.Internally studen zed residuals for each observa on 𝑋 : 𝑇 =
(Z-score method)
𝑇 follows a student’s t distribu on with df = N – 1
This means any point that is not an outlier we can do a formal test for “outlier” status, assuming a large sample
should not be too far from mean if 𝑇 > 𝐶 (C is usually 2 or 3), we label 𝑋 as an outlier
we deﬁne how far it is in terms of SD Cons:
- C (cut-point) can only be meaningfully chosen when X is normally distributed
- Both 𝑋 and 𝑆𝐷 are highly sensi ve to outliers

2.Externally studen zed residuals internally studen zed residuals but adjust 𝑋 and 𝑆𝐷 to remove the inﬂuence of
outlier itself will aﬀect mean and sd observa on we are evalua ng. dele on mean, dele on SD
Pros:
T(n) is immune the inﬂuence of the n-th observa on.
Cons:
- X is s ll required to be normally distributed
- can s ll be sensi ve to other outlier that is not n-th oberva on

3. Median absolute devia on method mean of X -> Med of X, SD -> median absolute devia on (MAD)
𝑀𝐴𝐷 = 𝑏 ∗ 𝑀𝑒𝑑(|𝑋 − 𝑀𝑒𝑑(𝑋)|) Pros: Immune to the inﬂuence of (50% at most) outliers.
𝑏=1 𝑄 = 1 0.6745 Cons:
.
(normal distribu on) - does not allow for formal sta s cal tests
- X is required to be parametric distribu on (need to compute b)

4. Tukey’s boxplot method Pros: does not require normally distributed X, not sensi ve to outliers
𝐼𝑄𝑅 = 𝑄 − 𝑄 Cons: does not allow for formal sta s cal tests
𝐹 = {𝑄 − 𝐶 ∗ 𝐼𝑄𝑅, 𝑄 + 𝐶 ∗ 𝐼𝑄𝑅}

C Fence Type Outlier Type Mean Dele on Mean Median Boxplot Method
1.5 Inner Possible 1/N 2/N N/2 25%
3 Outer Probable

Breakdown Point the minimum propor on of cased that must be replaced by inf to cause
the value of sta s c to go to inf
Mul variate Outliers e.g. a person in the 95th percen le for height & the 5th percen le for weight
How do we detect it?
Distance metrics quan fy the similarity of two vectors (similarity between an observa on & mean vector)
- Mahalanobis distance mul variate generaliza on of internally studen zed residual
Cons: it is compute using all observa ons, so sensi ve to outlier also.
- Robust Mahalanobis distance Minimum covariance determinant method(MCD), using only good subset of data to es mate

How far away an observa on is from the center of data cloud, rela ve to the size of cloud
Use less observa on of data therefore less inﬂuenced by outliers

€7,16

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

AliceOuterspace

5,0

(1)

Maak kennis met de verkoper

AliceOuterspace Tilburg University

Bekijk profiel

Volgen

Verkocht

Lid sinds

1 jaar

Aantal volgers

Documenten

Laatst verkocht

7 maanden geleden

5,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper AliceOuterspace. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,16. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 45736 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Summary and Study Guide for Statistics and Methodology (25 Spring)

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Tilburg University (UVT) > Data Science & Society

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?