STA3022F Summaries
Chapter 1 - Chapter 10
2023
1
,CH 1: DATA
Data types:
- Numerical variables: measurements that can be recorded on a quantitative scale
where the intervals between two values on the scale have some consistent
meaning
• Ex. Height, age, number of children
• Can further classify numerical variables as continuous if they can take on any
intermediate value on the scale (e.g. height) or discrete if the values a variable
can take on are limited in some way, often to the set of whole numbers (e.g.
number of children).
- Categorical variables: measurements of individuals in terms of groups or
categories where the gap between categories have no intrinsic meaning.
- Ratio-scaled numerical variables are those that have a natural zero point (like age,
height, and income). Called ratio scaled because not sensitive to units of
measurements.
- Interval-scaled variables are still numeric but do not have a natural zero point (IQ
and temperature in degrees Celsius are of this type). Interval-scaled variables
therefore have an arbitrary zero point and an arbitrary scale
- Ordinal categorical variables are those where the categories can be ordered even
if the gaps between them cannot be interpreted (such as level of education, which
can be ordered: none, primary-school, high-school, undergraduate degree,
postgraduate degree)
- Nominal categorical variable cannot be ordered in any meaningful way (such as
race or language group)
- Likert/rating scales:
• measurement scale usually ranging from some negatively worded statement
(e.g. “strongly disagree”, “terrible”) to some positively worded statement (e.g.
“strongly agree”, “excellent”).
• categorical because the numbers are only being used as labels for the written
descriptions, and a gap of one unit cannot be consistently interpreted.
2
,Standardising Data
- Data measured in di erent scales can cause issues in multivariate analysis as it
will give too much in uence on variables measured on larger scales.
- Steps:
• Calculate the mean and standard deviation of each variable in the data matrix
(i.e. these are the column means and the column standard deviations).
• Subtract each element in the data matrix by its column mean.
• Divide the resulting “element minus mean” by its column standard deviation.
Singular Value Decomposition
- D matrix: diagonal matrix with 0’s on o diagonals.
• Number of diagonal entries = min(n,p)
• Values in D are singular values (>= 0)
• Singular values ordered in decreasing order across diagonals
- SVD is the basis for approximating multivariate data by dimension reduction.
- Huygens’ Principle: the approx necessarily includes the centroid so we will centre
data matrix X before doing the approximation. (Unless X already standardised)
3
fffl ff
, CH 2: PRINCIPAL COMPONENT ANALYSIS
- Main aim: Dimension Reduction
- New uncorrelated variables will be denoted by Y1,…,Yr and these will be a linear
combination of original variables X1,…,Xp
- Each principal component Yi is a linear combination of the Xi variables (usually
original ones in standardised form) in such a way that the rst axis (i.e., the rst
principal components) is in the direction containing most variation.
4
fi fi
Chapter 1 - Chapter 10
2023
1
,CH 1: DATA
Data types:
- Numerical variables: measurements that can be recorded on a quantitative scale
where the intervals between two values on the scale have some consistent
meaning
• Ex. Height, age, number of children
• Can further classify numerical variables as continuous if they can take on any
intermediate value on the scale (e.g. height) or discrete if the values a variable
can take on are limited in some way, often to the set of whole numbers (e.g.
number of children).
- Categorical variables: measurements of individuals in terms of groups or
categories where the gap between categories have no intrinsic meaning.
- Ratio-scaled numerical variables are those that have a natural zero point (like age,
height, and income). Called ratio scaled because not sensitive to units of
measurements.
- Interval-scaled variables are still numeric but do not have a natural zero point (IQ
and temperature in degrees Celsius are of this type). Interval-scaled variables
therefore have an arbitrary zero point and an arbitrary scale
- Ordinal categorical variables are those where the categories can be ordered even
if the gaps between them cannot be interpreted (such as level of education, which
can be ordered: none, primary-school, high-school, undergraduate degree,
postgraduate degree)
- Nominal categorical variable cannot be ordered in any meaningful way (such as
race or language group)
- Likert/rating scales:
• measurement scale usually ranging from some negatively worded statement
(e.g. “strongly disagree”, “terrible”) to some positively worded statement (e.g.
“strongly agree”, “excellent”).
• categorical because the numbers are only being used as labels for the written
descriptions, and a gap of one unit cannot be consistently interpreted.
2
,Standardising Data
- Data measured in di erent scales can cause issues in multivariate analysis as it
will give too much in uence on variables measured on larger scales.
- Steps:
• Calculate the mean and standard deviation of each variable in the data matrix
(i.e. these are the column means and the column standard deviations).
• Subtract each element in the data matrix by its column mean.
• Divide the resulting “element minus mean” by its column standard deviation.
Singular Value Decomposition
- D matrix: diagonal matrix with 0’s on o diagonals.
• Number of diagonal entries = min(n,p)
• Values in D are singular values (>= 0)
• Singular values ordered in decreasing order across diagonals
- SVD is the basis for approximating multivariate data by dimension reduction.
- Huygens’ Principle: the approx necessarily includes the centroid so we will centre
data matrix X before doing the approximation. (Unless X already standardised)
3
fffl ff
, CH 2: PRINCIPAL COMPONENT ANALYSIS
- Main aim: Dimension Reduction
- New uncorrelated variables will be denoted by Y1,…,Yr and these will be a linear
combination of original variables X1,…,Xp
- Each principal component Yi is a linear combination of the Xi variables (usually
original ones in standardised form) in such a way that the rst axis (i.e., the rst
principal components) is in the direction containing most variation.
4
fi fi