Lecture 1
• Data
• Numerical (quantitative)
• Continuous (14,271…)
• Discrete (1,2…)
• Categorical (qualitative, can be coded numerical)
• Coded
• Verbal label
• Measurement level
• Data matrix/frame
• Columns: variables
• Rows: subjects/cases
• Cells: observations
• WARNINGS
• Coding has no impact on variable type
• Add meta-data to dataset, vocabulary with all variable descriptions and coding
themes
• Missing data is often coded (NA) or deleted or imputed (guessed)
• Outliers are observations that show substantially dissimilar behavior from the
bulk of the data, influence the mean/other outcomes heavily
• Check whether your results change with versus without the outliers
and report and interpret this
• Delete
• Censor at 99% or same value
,• Measure of centre
• mean ( )
Excel function: average
• The median ( )
• The minimum and maximum observation
• The mid-range:
average of the minimum and maximum observation (sensitive measure to
outliers)
• The 25th and 75th (and other) percentiles: 𝑝th percentile for us will be the
• Geometric mean:
Excel function: geomean
• 𝒌% trimmed mean: take the sample mean discarding the 𝑘% highest and
lowest observations (to protect yourself from outliers, but use more info than
the median)
* We might want to rely on the median, because the median is a more easily
understood and recognized measure of central tendency (actually: only use
geometric mean in the context of growth rates.
• Measures of variability or spread
• The sample variance 𝑠 ! (for population excel function: VAR) and standard
deviation 𝑠 = √𝑠 ! (excel function: STDEV)
n n
• Range:
• Interquartile range:
• Mean Absolute Deviation:
• Frequency:
• Skewness
measure of asymmetry
• Kurtosis
measure of tail flatness/fatness
large: more chance of outliers/huge outcomes
Mainly used as a benchmark for normality or symmetry: Kurt≈3
NB: Excess Kurtosis = Kurtosis –3
So 𝐾𝑢𝑟𝑡≈3 is the same as 𝐸𝑥𝑐𝑒𝑠𝑠𝐾𝑢𝑟𝑡≈0
Beware: R gives kurtosis, while Excel gives excess kurtosis
, Lecture 2
• Probability P(A)
- (0 < P(A) < 1)
if (probability > 1 or < 0, then probability = 0)
- A is an event, A’ denotes not an event
- Odds for a
Odds against a
- 𝑃 (𝐴 ∪ 𝐵): probability of either A or B or both happening
𝑃 (𝐴 ∩ 𝐵): means probability of both A and B happening jointly
- General law of addition: 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) – 𝑃 (𝐴∩𝐵)
- Conditional probability: 𝑃(𝐴/𝐵) = 𝑃(𝐴∩𝐵)/𝑃(𝐵) only in the given order!
- General law of multiplication: 𝑃(𝐴∩𝐵) = 𝑃(𝐴/𝐵)𝑃(𝐵) = 𝑃(𝐵/𝐴)𝑃(𝐴)
- Disjoint addition: 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Disjoint: events 𝐴 and 𝐵 in the sample space that have no overlap (no A∩B)
- 𝑃(𝐴) & 𝑃(𝐵): P(A)= P(A∩B) + P(A∩B’)
both need to be computed out of joint probability
- Bayes theorem:
Independence condition 1:
Independence condition 2:
*(they can happen individually/
no connection to another event's chances of happening)
Permutation:
When events cannot
Combination: occur at the same
time, they are
Factorial: called mutually
exclusive.
- Mutually exclusive events: if two events cannot occur at the same time. On the other hand, if
each event is
*Independent can’t be mutually exclusive (and the other way around) unaffected by other
- False positive: P(A / W’) alarm, no weapon events, they are
called independent
- False negative: P(A’ / W) no alarm, weapon events.
• sample correlation coefficient
Its range is -1 < r < +1.
Relation is positive/negative, strong/weak and linear/exponential
Excel function: CORREL
* The correlation for (𝑋, 𝑌) does not change if we replace the data by (𝑎𝑋, 𝑏𝑌) if 𝑎 >
0 and 𝑏 > 0.