DATA (GEA 1000) CHEAT SHEET 2025-
2026 UNIVERSITY OF SINGAPORE
, GEA1000 Cheat sheet AY 2025-2026
Research Targets
Population: Entire group we wish to know something about
Sample: A proportion of the population selected in the study
Sampling frame: “Source Material” from which sample is drawn
Census: An attempt to reach out to the entire population of interest
Major Biases
• Selection bias refers to the researcher’s biased selection of participants
• Non-response bias refers to participants’ non-participation in the
research
Probability Samp ling Methods
• Simple random sampling: A sample of size n is chosen from the
sampling frame such that every unit has an equal chance to be selected
• Systematic sampling: The xth unit is chosen from every n/k units Symmetry Rules
where x,k are chosen integers and n is the size of the sampling frame
• Stratified sampling: The population is divided into groups (strata) and 1. rate(A | B) > rate(A | NB) ⟺ rate(B | A) > rate(B | NA)
SRS is applied to each strata to form the sample 2. rate(A | B) < rate(A | NB) ⟺ rate(B | A) < rate(B | NA)
• Cluster sampling: The population is divided into clusters and a fixed 3. rate(A | B) = rate(A | NB) ⟺ rate(B | A) = rate(B | NA)
number of clusters are chosen using SRS
Basic Rule on Rates
Non-Probability Samp ling Methods
Convenience sampling: Subjects are chosen based on ease of availability rate(A | B) ≤ rate(A) ≤ rate(A | NB) or vice versa. This means:
Volunteer sampling: Subjects volunteer themselves into a sample
The closer rate(B) is to 100%, the closer rate(A) is to rate(A | B
Generalisability Criteria If rate(B) = 50%, then rate(A) = 0.5[rate(A |B) + rate(A | NB)]
1. Sampling frame ≥ population If rate(A | B) = rate(A | NB), rate(A) = rate(A | B) = rate(A | NB
2. Probability sampling method implemented (selection bias ↓)
3. Large sample size (variability and random error ↓) Simpson’s Paradox
4. Minimise non-response rate A phenomenon in which a trend appears in more than half of th
groups of data but changes when the groups are combined
Variable Ty p es
Categorical: Variables that take on mutually exclusive categories Confounders
Numerical: Variables with numerical values where arithmetic can be • A third variable that is associated with both the independent
performed meaningfully dependent variables
• When a confounder is present, segregate the data by the
Variable Sub-Ty p es confounding variable. This method is called slicing
Ordinal: Categorical variables where there is some natural ordering
Nominal: Categorical variable where there is no intrinsic ordering Outliers
Discrete: Numerical variable with gaps in the set of possible numbers • An outlier is an observation that falls well above or below th
Continuous: Numerical variable that can be all values in a given range overall bulk of the data
Random: Numerical variable with probabilities assigned to each value • A general rule is that outliers should not be removed
unnecessarily
Prop erties of Mean (x̄ ) and Median (r) • x is an outlier if x > Q3 + 1.5·IQR or x < Q1 - 1.5·IQR
1. Adding c to all data points changes x̄ to x̄ + c and r to r + c
2. Multiplying c to all data points changes x̄ to cx̄ and r to cr
Prop erties of Standard Deviation and IQR
1. sx and IQR are positive and 0 only when all data points are identical
2. Adding c to all data points does not change sx and IQR
3. Multiplying c to all data points changes sx to |c|sx and IQR to |c|IQR
Study Designs Analysing Histograms
Observational study: Individuals are observed and variables are
Experimental study: The independent variable is intentionally
measured without any manipulation
manipulated to observe its effect on the dependent variable
Blinding