Lecture 1: Confidence intervals and Hypothesis Testing........................................2
Lecture 2: Sample size calculations Wilcoxon rank tests........................................9
Lecture 3: One and two porportions.....................................................................18
Lecture 4: Chi square test & correlation...............................................................24
Lecture 5: Linear models:: simple linear regression.............................................31
Lecture 6: Multiple Linear Regression 1................................................................41
Lecture 7: Multiple linear regression 2.................................................................44
Lecture 8: One-way analysis of variance, pairwise comparisons, non-parametric F-
test....................................................................................................................... 48
Lecture 9: Two-way ANOVA aka factorial ANOVA.................................................53
Lecture 10: Block design & relative efficiency (RE)..............................................60
Lecture 11: Quantitative and categorical x-variables ANCOVA / General Linear
Models.................................................................................................................. 66
, Lecture 1: Confidence intervals and Hypothesis Testing
What is a confidence interval? A confidence interval for a population parameter gives a range of
plausible values for that parameter based on the sample. Values inside the interval are plausible
parameter values given the observed sample.
Frequentist interpretation: A 1−α (for example, 95%) confidence interval procedure means: If we
repeated the exact sampling and interval-construction process many times (say 100 times), then
about 100×(1−α) of those intervals would contain the true population parameter.
So for a 95% CI: “We are 95% confident that the true parameter is inside this interval.” This is not the
same as saying there is a 95% probability that the particular interval you computed contains the
parameter, the probability statement refers to the procedure over repeated samples.
General formula for a two-sided t-based CI for a mean or difference of means
For many t-procedures the two-sided 100(1−α)% confidence interval has the form:
estimate ± t df (α /2)×standard error
estimate = the point estimate (e.g., x́ for a single mean, or x́ 1−x́ 2for a difference of means).
t df (α /2)= critical value from the Student’s t distribution with appropriate degrees of freedom, for
the two-tailed α-level.
standard error = depends on the problem (see formulas below).
Factors that make a CI narrower (more precise): larger sample size n ,
smaller variability in the data (smaller s), and lower confidence level
(smaller 1−α) — but lowering confidence level reduces reliability.
The t distribution and degrees of freedom: The t distribution is
similar to the normal distribution but has heavier tails; it is used
when the population standard deviation σ is unknown and estimated
from the data. As sample size (or degrees of freedom) grows, the t
distribution approaches the normal distribution. degrees of freedom determine exact shape of t-
distribution
Degrees of freedom (df) quantify how well the standard deviation sis estimated; more df → closer to
normal. Typical df:
o One-sample mean or paired differences: df = n−1.
o Two-sample pooled t (equal variances assumed): df = n1 +n 2−2.
o Welch’s (unequal variances): a complicated approximation (Welch–Satterthwaite
formula), typically non-integer. See formula below.
Intuitively: df reflect how much independent information you had to estimate variability.
,Standard errors: formulas you must know
s
1. One-sample mean: SE(x́)=
√n
where sis the sample standard deviation and n is sample size.
2. Paired t (differences)
o Convert paired observations to differences d i =x i , after −x i ,before .
1 sd
o Use one-sample formulas on differences: d́= ∑d i ,SE( d́ )=
n √n
o df = n−1where n is the number of pairs.
3. Two-sample t with equal variances (pooled)
(n1−1) s12+(n2−1) s 22
2
o Pool the sample variances to get a pooled standard deviation: s = p
n1+ n2−2
√
and s p= s2p .
o Standard error of the difference of means: SE(x́ 1−x́ 2)=s p
df = n1 +n 2−2.
√ 1 1
+
n1 n 2
4. Two-sample t without equal variances (Welch’s t)
o Do not pool variances. Use: SE=
√ s21 s22
+
n1 n2
s1 s 2
2 2 2
+ ) (
o Approximate degrees of freedom using Welch–Satterthwaite: n1 n2
df ≈
¿¿¿
(This yields a positive real number; statistical software uses this.)
o Welch’s test is default in R and is safer when variances differ.
Sampling distribution of the difference between two sample means
We consider two independent samples:
Sample 1: size n1 , sample mean ý 1, population variance σ 1
2
Sample 2: size n2 , sample mean ý 2, population variance σ 22
We are interested in the statistic: ý 1− ý 2
This is an estimator of the population difference: μ1−μ 2
The sampling distribution of ( ý 1− ý 2 )is approximately normal for large samples because of the
Central Limit Theorem (CLT): Each sample mean is approximately normal when the sample size is
large or the population is normal. And the difference of two normally distributed variables is also
normally distributed. So: ý 1− ý 2 ≈ Normal distribution
The expected value (mean) of ý 1− ý 2is: μ ý − ý =μ1 −μ 2
1 2
This makes intuitive sense because on average, a sample mean estimates its population mean.
, Therefore, the difference of two sample
means estimates the difference of two population means.
√
2 2
σ σ
The standard error of the sampling distribution is: σ ý − ý = 1 + 2
1 2
n1 n 2
Why this formula? The variance of a sample mean is σ 2 /n. Since the samples are independent,
variances add then take the square root to get the standard error. This formula is the general case
when variances are not assumed equal.
When we assume the two population variances are equal, we simplify: σ 21=σ 22 =σ 2
1 2
√
In that case: σ ý − ý = σ (
2 1 1
+ )
n1 n2
But we do not know σ 2, it’s a population value. So we must estimate it using sample data. That’s
where the pooled standard deviation comes in.
Since we assume that both populations have the same variance and the best estimate of that
common variance is a pooled (combined) estimate.
√
2 2
(n1−1)s1 +(n2−1)s2
Definition shown in the slide: s p=
n 1+ n2−2
Meaning:
2 2
We take each sample’s variance s1 , s 2
Weight them by degrees of freedom ni −1
Average them
Then take the square root
This is a more accurate estimate of a shared variance than using either sample alone.
Degrees of freedom for the pooled variance: df =n1 +n2−2
This matches how many independent pieces of information were used in estimating the common
variance.
Once you have s p , the standard error of the sample difference becomes: SE( ý 1− ý 2 )=s p
This is the formula used for a pooled t-test or CI for two means with equal variances
√ 1 1
+
n 1 n2
Confidence interval for μ1−μ 2(equal variances)
The slide shows the formula:
Where: