Statistics and Methodology
¨ Foundation of Statistics
Statistical reasoning: systematize the way we evaluate uncertainty of data-based decisions
J Protect ourselves from overstating our findings
Statistical testing: quantify and control for uncertainty
à Output = test statistic
à Objective reference à p-value
Concepts:
Variability How spread out a dataset is
Probability distribution Re-scaled frequency distribution
- y-axis = probability density
- Area under graph = 1
Marginal/unconditional à only one variable, constant mean (0)
Conditional à two variables, distribution of y (and its mean)
depends on the value of x
Sampling distribution A mathematical function that describes all of the possible values
that a parameter can take
One kind of probability distribution
Population = possible values of the test statistic (parameter, ✘
random variable) over infinite repeated sampling
P-value Probability of observing a given test statistic in the
(frequentist) corresponding sampling distribution if H0 is true
One-sided à do not care another direction at all, Type I error
(Need to decide one-sided/two-sided before testing)
Statistical Modelling: mathematical representation describing only the important features of
a distribution à J control confounds
Inference Relationship between variables
Prediction Guess
¨ Data Science Cycle
, 1. Define Problem
Research Design: design not experiments (experimental data) but analysis (observational
data)
- Operationalize research questions (vague à analyzable)
J Statistically rigorous à can be answered in a statistical way
J Quantifiable à clear outcome variable
J A set of hypotheses (if possible)
- Designing analysis
? Supervised vs unsupervised
? Inference vs prediction à causal inference more costly than correlation
? Probabilistic answers vs binary decisions
? Extrinsic limitations (e.g. time, resources, ethical issues)
2. Data Collection
? Required variables à measured / constructed
? Sensitive data à proxies
? Rare data à preferential sampling
? Experimental data vs observational data
? Sample size à power analysis
? Secondary data source à Access? Quality? Processing required?
3. Data Processing
4. Data Cleaning à Analyzable format, legal values, outliers & missing data well-handled
Missing data = empty cells where observed values should have been there
¨ Missing data pattern à Unique combination of observed & missing items
- Size = 2P, where P = no. of variables
- No missing is also one pattern
¨ Non-response rates
Percent missing Computed for each variable
à screen out “hopeless” variables
Attrition rate For longitudinal data (monotone pattern)
Proportion of participants that drop out at one time
Percent of complete cases Useful for list-wise deletion (which is a bad method)
Covariance coverage % of cases available to examine pairwise relationship
à instances with observed values for the required variables
Fraction of missing Measure on how well we treat missing values
information
¨ Foundation of Statistics
Statistical reasoning: systematize the way we evaluate uncertainty of data-based decisions
J Protect ourselves from overstating our findings
Statistical testing: quantify and control for uncertainty
à Output = test statistic
à Objective reference à p-value
Concepts:
Variability How spread out a dataset is
Probability distribution Re-scaled frequency distribution
- y-axis = probability density
- Area under graph = 1
Marginal/unconditional à only one variable, constant mean (0)
Conditional à two variables, distribution of y (and its mean)
depends on the value of x
Sampling distribution A mathematical function that describes all of the possible values
that a parameter can take
One kind of probability distribution
Population = possible values of the test statistic (parameter, ✘
random variable) over infinite repeated sampling
P-value Probability of observing a given test statistic in the
(frequentist) corresponding sampling distribution if H0 is true
One-sided à do not care another direction at all, Type I error
(Need to decide one-sided/two-sided before testing)
Statistical Modelling: mathematical representation describing only the important features of
a distribution à J control confounds
Inference Relationship between variables
Prediction Guess
¨ Data Science Cycle
, 1. Define Problem
Research Design: design not experiments (experimental data) but analysis (observational
data)
- Operationalize research questions (vague à analyzable)
J Statistically rigorous à can be answered in a statistical way
J Quantifiable à clear outcome variable
J A set of hypotheses (if possible)
- Designing analysis
? Supervised vs unsupervised
? Inference vs prediction à causal inference more costly than correlation
? Probabilistic answers vs binary decisions
? Extrinsic limitations (e.g. time, resources, ethical issues)
2. Data Collection
? Required variables à measured / constructed
? Sensitive data à proxies
? Rare data à preferential sampling
? Experimental data vs observational data
? Sample size à power analysis
? Secondary data source à Access? Quality? Processing required?
3. Data Processing
4. Data Cleaning à Analyzable format, legal values, outliers & missing data well-handled
Missing data = empty cells where observed values should have been there
¨ Missing data pattern à Unique combination of observed & missing items
- Size = 2P, where P = no. of variables
- No missing is also one pattern
¨ Non-response rates
Percent missing Computed for each variable
à screen out “hopeless” variables
Attrition rate For longitudinal data (monotone pattern)
Proportion of participants that drop out at one time
Percent of complete cases Useful for list-wise deletion (which is a bad method)
Covariance coverage % of cases available to examine pairwise relationship
à instances with observed values for the required variables
Fraction of missing Measure on how well we treat missing values
information