Statistics GSS – GEO2-2428
Course aims
1. Understand the theoretical and mathematical basis of statistical methods
2. Determine the appropriate statistical analysis method for a research question
3. Conduct the statistical analysis in R
4. Interpret the findings of the statistical analysis
5. Report the results of statistical analyses in a clear and accurate way
Lecture 1 –Introduction to Statistics GSS – 5/2/2025
Assessment →
Exam 1 (35%) → 28/2/2025
Assignment 1 (15%) → 6/3/2025
Exam 2 (35%) → 4/4/2025
Assignment 2 (15%) → 9/4/2025
Lecture 2 – Descriptive statistics and theory estimates –
5/2/2025
Data variables
- Data variables = different types of data
Experimental setups include:
- Response (dependent): What is under observation (Y)
- Explanatory (independent): what is under control (X)
- In an XY-plot, typically, the response is y, the explanatory is x
Why is understanding data types so important?
- The hardest part of any statistical work.. is choosing the right statistical analysis.
The choice depends on the nature of your data and the particular question you
are trying to answer
Types of data: dimensions and units are important!
Numeric vs categorical data
Numeric data is recorded as a quantifiable number.
- It can be continuous – infinitely spread over a range of values (can have (a lot of)
decimals, not per an exact number) → e.g., time, length, weight, area, etc.
- It can also be discrete – whole number values → e.g., data collection day,
number of individuals, count of an occurrence, etc.
,Categorical data is recorded as a qualitative characteristic
- Ordinal – categories with an ordered relation → e.g., small, medium, large;
none, low, moderate, high
- Nominal – categories without ordered relation →, e.g., municipality, color,
species
- Binominal – categories with two possibilities → e.g., yes/no
Organizing our data: how to construct a data frame
- In a data frame, data for each variable should be organized into a column
- The number of rows should be even to the number of observations (n)
- Data frames provide a clear format (matrix) in which data analysis tools such as
Excel and Rstudio can best interpret
- Proper data input = proper plotting and statistics
What comes next in a statistical analysis
- Descriptive statistics: what does our data look like?
- Inferential statistics: what can we infer from that?
Descriptive vs inferential statistics
Descriptive statistics describe data using:
- Graphs, e.g., boxplots, histograms, scatterplots
- Tables
- Summary calculations, e.g., medium, mean/average, standard deviation
Inferential statistics make general conclusions by analyzing trends within a sample
and comparing them to standard models to (try to) understand:
- How does a sample relate to generalized findings and vice-versa?
- Are any differences more than a coincidence (i.e., is it statistically significant?)
- How can past and current data help to project future outcomes?
Why is central tendency important?
- Mode: most often recorded value
- Median: middle value
- Mean: average value
- In normal distribution: mode = mean = median
- Central limit theory: large enough sample sizes will
generally present a ‘normal’ spread from the center
value
- -/+ 1 quartile from the median contains 50% of the
observations
- -/+ 1 standard deviation from the mean contains appr.
68% of the observations
- Data is often not ‘normal
, - Right skew: mode < median < mean
- Left skew: mean < median < mode
- The first step in stats is to check how ‘normally’ spread your data is from its
middle
- Mean = average (sum of observations)/(total
number of observations)
- Median = middle value if you reorder values from
smallest to largest. If there is no middle-value sum
up the two middle ones and take the average
Dispersion: deviation from the mean
- Deviation (dev) = by how much a data point differs
from the mean
1. Sum of squares
- 𝒔𝒔𝒙 = ∑(𝒙 − 𝒂𝒗𝒆𝒓𝒂𝒈𝒆)𝟐
2. Degrees of freedom and the variance
- 𝑑𝑓 = 𝑛 − 1 , Mean = set value for comparison (hence, -1), Df = maximum number
of values that can vary from the mean
𝒔𝒔
- 𝑆 2 = 𝑣𝑎𝑟𝑥 = 𝑑𝑓𝒙
- A variance of 0 means none of the data points diverge from the mean, there is no
variation
3. Standard deviation
- Variance is a squares metric (var or 𝑆 2 ), to standardize it, we need to find the
square root, or standard deviation (sd or S)
- 𝑠𝑑𝑥 = √𝑣𝑎𝑟𝑥
- The standard deviation from the mean tells us how spread out our data is from
the mean
4. Coefficient of variation
- The ratio of standard deviation over the mean
𝑠𝑑𝑥
- Coefficient of variation = 𝐶𝑉 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 ∗ 100
- Tells how relatively spread out data is from sample mean
- High CV = large spread/variation from the mean, less central tendency, flatter
bell curve
- Low CV = low spread/variation from the mean, more central tendency, steeper
bell curve
Data quartiles
1. First quartile, Q1
𝑋
- 𝑄1 = 𝑛+1 4
2. Second quartile, Q2
𝑋
- 𝑄2 = 𝑛+1 2
3. Third quartile
𝑋
- 𝑄3 = 𝑛+3 4
Interquartile range (IQR)
- Measures spread of the middle 50% of the data
- IQR = Q3 – Q1
- Large IQR = more dispersed mid-range
, - Small IQR = more clusters mid-range
Outliers
- Values outside of the min. to max. quartile range
- Min = Q1 – 1.5 * IQR
- Max = Q3 + 1.5 * IQR
Statistical Toolbox Part 1
Measures of central tendency
- Mean (average)
- Median
Measures of dispersion (spread)
- The sum of squares (SS)
- Degrees of freedom (df)
- Variance (S, var)
- Standard deviation (sd)
- Coefficient of variation (CV)
- Inter-quartile range (IQR)
Descriptive statistics in research
Very useful for:
- Data cleaning
- Data preparation
- Providing (initial) insights into the dataset
Where to find/include in a report:
- Methods: data cleaning, preparation, and characterization
- Results: show/use (in part) descriptive statistics
Population vs. sample
- Population = universe of units
- Sample = segment of population selected for research
Why a sample?
- Resources
- Data availability
- The main reason is efficiency, and the disadvantage is uncertainty
Population vs. sample: standard notation
Population parameter Sample statistic
Size = N (number of observations) Size = n (number of observations)
Average/mean = Mean = m or ỹ
Standard deviation = = √∑(𝒙 − )𝟐 /𝑵 Standard deviation = s, sd, or dev =
√∑(𝒚𝒊 − ỹ)𝟐 /(𝒏 − 𝟏)
Ensuring adequate sample size: why it is important?
- Central limit theory: Samples of at least 30 observations should generally
present a normal distribution
- More samples = higher n = higher df → more certainty in dataset/results +
stronger statistical inference
Randomization
- The process of assigning participants to treatment and control groups assumes
that each participant has an equal chance of being assigned to any group
Hypothesis testing
Course aims
1. Understand the theoretical and mathematical basis of statistical methods
2. Determine the appropriate statistical analysis method for a research question
3. Conduct the statistical analysis in R
4. Interpret the findings of the statistical analysis
5. Report the results of statistical analyses in a clear and accurate way
Lecture 1 –Introduction to Statistics GSS – 5/2/2025
Assessment →
Exam 1 (35%) → 28/2/2025
Assignment 1 (15%) → 6/3/2025
Exam 2 (35%) → 4/4/2025
Assignment 2 (15%) → 9/4/2025
Lecture 2 – Descriptive statistics and theory estimates –
5/2/2025
Data variables
- Data variables = different types of data
Experimental setups include:
- Response (dependent): What is under observation (Y)
- Explanatory (independent): what is under control (X)
- In an XY-plot, typically, the response is y, the explanatory is x
Why is understanding data types so important?
- The hardest part of any statistical work.. is choosing the right statistical analysis.
The choice depends on the nature of your data and the particular question you
are trying to answer
Types of data: dimensions and units are important!
Numeric vs categorical data
Numeric data is recorded as a quantifiable number.
- It can be continuous – infinitely spread over a range of values (can have (a lot of)
decimals, not per an exact number) → e.g., time, length, weight, area, etc.
- It can also be discrete – whole number values → e.g., data collection day,
number of individuals, count of an occurrence, etc.
,Categorical data is recorded as a qualitative characteristic
- Ordinal – categories with an ordered relation → e.g., small, medium, large;
none, low, moderate, high
- Nominal – categories without ordered relation →, e.g., municipality, color,
species
- Binominal – categories with two possibilities → e.g., yes/no
Organizing our data: how to construct a data frame
- In a data frame, data for each variable should be organized into a column
- The number of rows should be even to the number of observations (n)
- Data frames provide a clear format (matrix) in which data analysis tools such as
Excel and Rstudio can best interpret
- Proper data input = proper plotting and statistics
What comes next in a statistical analysis
- Descriptive statistics: what does our data look like?
- Inferential statistics: what can we infer from that?
Descriptive vs inferential statistics
Descriptive statistics describe data using:
- Graphs, e.g., boxplots, histograms, scatterplots
- Tables
- Summary calculations, e.g., medium, mean/average, standard deviation
Inferential statistics make general conclusions by analyzing trends within a sample
and comparing them to standard models to (try to) understand:
- How does a sample relate to generalized findings and vice-versa?
- Are any differences more than a coincidence (i.e., is it statistically significant?)
- How can past and current data help to project future outcomes?
Why is central tendency important?
- Mode: most often recorded value
- Median: middle value
- Mean: average value
- In normal distribution: mode = mean = median
- Central limit theory: large enough sample sizes will
generally present a ‘normal’ spread from the center
value
- -/+ 1 quartile from the median contains 50% of the
observations
- -/+ 1 standard deviation from the mean contains appr.
68% of the observations
- Data is often not ‘normal
, - Right skew: mode < median < mean
- Left skew: mean < median < mode
- The first step in stats is to check how ‘normally’ spread your data is from its
middle
- Mean = average (sum of observations)/(total
number of observations)
- Median = middle value if you reorder values from
smallest to largest. If there is no middle-value sum
up the two middle ones and take the average
Dispersion: deviation from the mean
- Deviation (dev) = by how much a data point differs
from the mean
1. Sum of squares
- 𝒔𝒔𝒙 = ∑(𝒙 − 𝒂𝒗𝒆𝒓𝒂𝒈𝒆)𝟐
2. Degrees of freedom and the variance
- 𝑑𝑓 = 𝑛 − 1 , Mean = set value for comparison (hence, -1), Df = maximum number
of values that can vary from the mean
𝒔𝒔
- 𝑆 2 = 𝑣𝑎𝑟𝑥 = 𝑑𝑓𝒙
- A variance of 0 means none of the data points diverge from the mean, there is no
variation
3. Standard deviation
- Variance is a squares metric (var or 𝑆 2 ), to standardize it, we need to find the
square root, or standard deviation (sd or S)
- 𝑠𝑑𝑥 = √𝑣𝑎𝑟𝑥
- The standard deviation from the mean tells us how spread out our data is from
the mean
4. Coefficient of variation
- The ratio of standard deviation over the mean
𝑠𝑑𝑥
- Coefficient of variation = 𝐶𝑉 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 ∗ 100
- Tells how relatively spread out data is from sample mean
- High CV = large spread/variation from the mean, less central tendency, flatter
bell curve
- Low CV = low spread/variation from the mean, more central tendency, steeper
bell curve
Data quartiles
1. First quartile, Q1
𝑋
- 𝑄1 = 𝑛+1 4
2. Second quartile, Q2
𝑋
- 𝑄2 = 𝑛+1 2
3. Third quartile
𝑋
- 𝑄3 = 𝑛+3 4
Interquartile range (IQR)
- Measures spread of the middle 50% of the data
- IQR = Q3 – Q1
- Large IQR = more dispersed mid-range
, - Small IQR = more clusters mid-range
Outliers
- Values outside of the min. to max. quartile range
- Min = Q1 – 1.5 * IQR
- Max = Q3 + 1.5 * IQR
Statistical Toolbox Part 1
Measures of central tendency
- Mean (average)
- Median
Measures of dispersion (spread)
- The sum of squares (SS)
- Degrees of freedom (df)
- Variance (S, var)
- Standard deviation (sd)
- Coefficient of variation (CV)
- Inter-quartile range (IQR)
Descriptive statistics in research
Very useful for:
- Data cleaning
- Data preparation
- Providing (initial) insights into the dataset
Where to find/include in a report:
- Methods: data cleaning, preparation, and characterization
- Results: show/use (in part) descriptive statistics
Population vs. sample
- Population = universe of units
- Sample = segment of population selected for research
Why a sample?
- Resources
- Data availability
- The main reason is efficiency, and the disadvantage is uncertainty
Population vs. sample: standard notation
Population parameter Sample statistic
Size = N (number of observations) Size = n (number of observations)
Average/mean = Mean = m or ỹ
Standard deviation = = √∑(𝒙 − )𝟐 /𝑵 Standard deviation = s, sd, or dev =
√∑(𝒚𝒊 − ỹ)𝟐 /(𝒏 − 𝟏)
Ensuring adequate sample size: why it is important?
- Central limit theory: Samples of at least 30 observations should generally
present a normal distribution
- More samples = higher n = higher df → more certainty in dataset/results +
stronger statistical inference
Randomization
- The process of assigning participants to treatment and control groups assumes
that each participant has an equal chance of being assigned to any group
Hypothesis testing