100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary and Study Guide for Statistics and Methodology (25 Spring)

Rating
-
Sold
1
Pages
16
Uploaded on
24-03-2025
Written in
2024/2025

see details in preview pages

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
March 24, 2025
Number of pages
16
Written in
2024/2025
Type
Summary

Subjects

Content preview

Statistics DSS (25spring)
Notes & Study Guide




Number in tle refers to corresponding Module
(order of module is adjusted for clearer structure)

With sign: given quiz & exam sample ques ons

To pass or get a good score in final exam, it is strongly
recommended to thoroughly engage with the material
and gain a deep understanding of the concepts and
terms.

Any ques on, please email to:

Version: 202503201719
By: Alice

, Statistical reasoning(1)
Sta s cal reasoning the founda on of all good sta s cal analyses is a deliberate careful, and thorough
considera on of uncertainty.
The purpose of sta s cs to systema ze the way that we account for uncertainty when making data-based
decisions
No need to memorize any formulas the larger the test sta s cs, the be er. (In general)
unless teacher specifically says so
(Week 1 Basic 3, 02:51) (Sta s cs is all about what we can see as human and how we decide based on what we see,
it is not about how the real world actually is)

Probability distribu on quan fy how likely it is to observe each possible value of some probabilis c en ty
(re-scaled frequency distribu ons) (e.g. height, the outcome variable)
Sta s cal Tes ng dis ll info into a simple sta s c to make a judgement, we weight the es mated effect
by the precision of the es mate.


Wald Test 𝑇=

Nil-null a null hypothesis of no effect Very Important: the possible value of test sta s c

t-test a way to summarize the comparison of two variables’ distribu on
the t-sta s c also has a sampling distribu on that quan fies the possible t-values we
could get if we repeatedly drew samples from the variables’ distribu on and re-
computed a t-sta s c each me.

Direc onal hypothesis

NOT a test sta s c P value
𝑃(𝑡 = 𝑡̂|𝐻 ) = 0 (the probability of observing any individual point on a con nuous
(Week 1 Part 1 Quiz 2) distribu on is exactly zero.)
CAN NOT say There is a 0.032 probability that the true mean difference is greater than zero.
There is a 0.032 probability that the null hypothesis is false.
There is a 0.032 probability that the observed result is due to chance alone.
There is a 0.032 probability of replica ng the observed effect in the future.
There is a 0.032 probability of observing 𝑡̂, if the null hypothesis is true.

How do we interpret the p value then? There is a 0.032 probability of observing a test sta s c at least as large as 𝒕, if the null
hypothesis is true. 𝑃(𝑡 ≥ 𝑡̂|𝐻 )

What do we want to know? The inversed ques on: what is the probability of the null hypothesis is true, given that
*not possible with null-hypothesis tes ng a t-sta s c is larger or equal than the es mated one?
possible only with Bayesian sta s cs

In what scenario do we use in experimental contexts.
sta s cal tes ng? While real world has messy observa onal data, has no control for confounding factors.

We need sta s cal modeling.

Sta s cal modeling build a mathema cal representa on of the (interes ng aspects) of a data distribu on
-> learn the important features of a distribu on (without a prior hypothesis)
Data Science Cycle (4 essen al steps) define a problem – collect data – process data – clean data (slide week 1 design p4)
Collec ng own data is NOT always preferred over secondary data (week 1 part 2 quiz 2)

EDA (Exploratory data analysis) 1) mindset than techniques/steps; 2) contrast with strict empiricist hypothesis tes ng
3) be used to generate hypotheses for CDA; 4) sanity check hypotheses, if fail, reject.
(can’t modify hypotheses based on these sanity checks and s ll test new hypotheses with the same data)
5) if don’t care about tes ng hypotheses, focus on EDA.
CDA (confirmatory data analysis) 1) if data are well-understood, proceed directly to CDA;
*CDA and EDA can NOT stand alone

, Outliers(2)

What is univariate outlier? Extreme values with respect to the distribu on of a variable’s other observa ons
- illegal value: data entry errors (most common cause)
- legal value: extreme values (e.g. a person 3-meter high)

We choose to view an outlier as arising from a different popula on than the one to
which we want to generalize our findings.
What are the methods to diagnose
poten al outliers?
1.Internally studen zed residuals for each observa on 𝑋 : 𝑇 =
(Z-score method)
𝑇 follows a student’s t distribu on with df = N – 1
This means any point that is not an outlier we can do a formal test for “outlier” status, assuming a large sample
should not be too far from mean if 𝑇 > 𝐶 (C is usually 2 or 3), we label 𝑋 as an outlier
we define how far it is in terms of SD Cons:
- C (cut-point) can only be meaningfully chosen when X is normally distributed
- Both 𝑋 and 𝑆𝐷 are highly sensi ve to outliers

2.Externally studen zed residuals internally studen zed residuals but adjust 𝑋 and 𝑆𝐷 to remove the influence of
outlier itself will affect mean and sd observa on we are evalua ng. dele on mean, dele on SD
Pros:
T(n) is immune the influence of the n-th observa on.
Cons:
- X is s ll required to be normally distributed
- can s ll be sensi ve to other outlier that is not n-th oberva on

3. Median absolute devia on method mean of X -> Med of X, SD -> median absolute devia on (MAD)
𝑀𝐴𝐷 = 𝑏 ∗ 𝑀𝑒𝑑(|𝑋 − 𝑀𝑒𝑑(𝑋)|) Pros: Immune to the influence of (50% at most) outliers.
𝑏=1 𝑄 = 1 0.6745 Cons:
.
(normal distribu on) - does not allow for formal sta s cal tests
- X is required to be parametric distribu on (need to compute b)

4. Tukey’s boxplot method Pros: does not require normally distributed X, not sensi ve to outliers
𝐼𝑄𝑅 = 𝑄 − 𝑄 Cons: does not allow for formal sta s cal tests
𝐹 = {𝑄 − 𝐶 ∗ 𝐼𝑄𝑅, 𝑄 + 𝐶 ∗ 𝐼𝑄𝑅}

C Fence Type Outlier Type Mean Dele on Mean Median Boxplot Method
1.5 Inner Possible 1/N 2/N N/2 25%
3 Outer Probable

Breakdown Point the minimum propor on of cased that must be replaced by inf to cause
the value of sta s c to go to inf
Mul variate Outliers e.g. a person in the 95th percen le for height & the 5th percen le for weight
How do we detect it?
Distance metrics quan fy the similarity of two vectors (similarity between an observa on & mean vector)
- Mahalanobis distance mul variate generaliza on of internally studen zed residual
Cons: it is compute using all observa ons, so sensi ve to outlier also.
- Robust Mahalanobis distance Minimum covariance determinant method(MCD), using only good subset of data to es mate

How far away an observa on is from the center of data cloud, rela ve to the size of cloud
Use less observa on of data therefore less influenced by outliers
$8.66
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
AliceOuterspace
5.0
(1)

Get to know the seller

Seller avatar
AliceOuterspace Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
4
Member since
1 year
Number of followers
0
Documents
4
Last sold
7 months ago

5.0

1 reviews

5
1
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions