Lecture notes
Lecture 1: Measurement, scaling and norms
Psychological construct (not observable = latent variable) -> observable behaviour
(operational definitions)
Degree of depression -> response to item
Measuring psychological attributes
- Observable behaviour is sensitive to psychological construct
- Determined in systematic way (response to test item)
- For the purpose of making comparisons
1. Individual differences (over time)
2. Inter-individual differences (different people)
Scaling
= the way numerical values are assigned to psychological attributes
- More practical: how is a test score or category determined from the
observations?
- 4 scales of measurements (nominal, ordinal, interval, and ratio)
Interpretation of test scores
- Test: systematic behavioural sample
- Scaling: assigning quantitative test score
- Norming: interpretation of test scores
Norming
- Distribution of test scores
- Standard score Z -> Z = (X-Xbar): sx
1. Number of standard deviations from the mean
2. Positive and negative values
3. Mean = 0, SD= 1/-1
- Converted standard score TX
TX= 10 x ZX + 50
- Percentile ranks PX: percentage of scores above or equal to a specific test score ->
for X = 10, PX is the % of people with score ≤ 10
PX= (absolute amount X) : 100 -> look up cumulative PX value in output table
Lecture 2: Reliability
2.1
Reliability = the extent to which differences in test scores are a function of real individual
differences (true scores), and the extent to what a test is free of random errors
Validity = the extent to which the test measures what it’s intended to measure, and the
extent to which the test is free of systematic errors
Classical test theory:
= for every subject, the observed score is the sum of the true score and random error
True score: not directly observable (= latent variable -> must be estimated)
, Error: difference between observed and true score (Xe = Xo – Xt). Can be positive or
negative and is also a latent variable
Assumptions:
1. µe = 0 -> mean error in population is zero (no systematic over- or underestimation of
true scores for population as a whole)
2. ret = 0 errors are completely uncorrelated with true scores (no systematic over- or
underestimation of true scores in subpopulations
3. reiej = 0 -> errors are completely uncorrelated with each other (error of subject 1 says
nothing about subject 2 etc.)
Variance:
Variance of Xo as a composite variable (Xo = Xt + Xe)
Because of assumption 2, variance of observed scores is equal to true score variance plus
error variance:
Reliabilty coefficient
= (Rxx) proportion of variance of observed scores explained by true scores
If CTT assumptions are valid, then in all cases 0 ≤ Rxx ≤ 1
Proportion of explained varance is a squared correlation, therefore alternative definition of
reliability is squared correlation of observed scores with true scores:
Unsquared correlation rot is called reliability index (not used often)
Standard error of measurement (two ways):
2.2 Estimating reliability
Parallel measurements
1. alternate forms: two different tests for the same construct
2. test-retest: same test at two different times
3. split-half: two parallel half-tests
Alternate forms:
Requirement: two measurements must be parallel, they should:
- measure exactly the same true scores
- have identical error variances
, Consequences:
- identical observed variances:
- identical correlations with true score
Problems:
1. Are tests really parallel? Never certain
Partial solutions: Domain sampling & consequences of parallelness
2. Carry-over effects (taking test 1 can influence results of test 2 -> can lead to
correlation between tests being too high -> overestimation of reliability)
Test-retest
= more plausible with test-retest than with parallel forms. Shouldn’t a test be parallel to
itself? Yes, but..
- People change: lower rxy -> underestimation of reliability -> short time between
test and retest!
- Carry-over effects: perhaps even stronger than with parallel tests -> can lead to
correlated errors or change in error variance (over- or underestimation) -> long
time between test and retest!
Split-half (from half-test to total test)
= correlation between (parallel) half-test -> reliability of half-tests, but we want reliability for
whole test, so..
Spearman-Brown formula
= gives effect on reliabilty of lengthening (or shortening) the test
“What whould be the reliability of the lengthened test if test with known reliability is made
n times as long?”
n = number of test halves
Problems:
- Parallelness
- Many splits are possible
Limited solutions:
- Most parallel half-tests
- Parallel item-pairs
- Evaluation of solution (split-half is not used often)
Lecture 1: Measurement, scaling and norms
Psychological construct (not observable = latent variable) -> observable behaviour
(operational definitions)
Degree of depression -> response to item
Measuring psychological attributes
- Observable behaviour is sensitive to psychological construct
- Determined in systematic way (response to test item)
- For the purpose of making comparisons
1. Individual differences (over time)
2. Inter-individual differences (different people)
Scaling
= the way numerical values are assigned to psychological attributes
- More practical: how is a test score or category determined from the
observations?
- 4 scales of measurements (nominal, ordinal, interval, and ratio)
Interpretation of test scores
- Test: systematic behavioural sample
- Scaling: assigning quantitative test score
- Norming: interpretation of test scores
Norming
- Distribution of test scores
- Standard score Z -> Z = (X-Xbar): sx
1. Number of standard deviations from the mean
2. Positive and negative values
3. Mean = 0, SD= 1/-1
- Converted standard score TX
TX= 10 x ZX + 50
- Percentile ranks PX: percentage of scores above or equal to a specific test score ->
for X = 10, PX is the % of people with score ≤ 10
PX= (absolute amount X) : 100 -> look up cumulative PX value in output table
Lecture 2: Reliability
2.1
Reliability = the extent to which differences in test scores are a function of real individual
differences (true scores), and the extent to what a test is free of random errors
Validity = the extent to which the test measures what it’s intended to measure, and the
extent to which the test is free of systematic errors
Classical test theory:
= for every subject, the observed score is the sum of the true score and random error
True score: not directly observable (= latent variable -> must be estimated)
, Error: difference between observed and true score (Xe = Xo – Xt). Can be positive or
negative and is also a latent variable
Assumptions:
1. µe = 0 -> mean error in population is zero (no systematic over- or underestimation of
true scores for population as a whole)
2. ret = 0 errors are completely uncorrelated with true scores (no systematic over- or
underestimation of true scores in subpopulations
3. reiej = 0 -> errors are completely uncorrelated with each other (error of subject 1 says
nothing about subject 2 etc.)
Variance:
Variance of Xo as a composite variable (Xo = Xt + Xe)
Because of assumption 2, variance of observed scores is equal to true score variance plus
error variance:
Reliabilty coefficient
= (Rxx) proportion of variance of observed scores explained by true scores
If CTT assumptions are valid, then in all cases 0 ≤ Rxx ≤ 1
Proportion of explained varance is a squared correlation, therefore alternative definition of
reliability is squared correlation of observed scores with true scores:
Unsquared correlation rot is called reliability index (not used often)
Standard error of measurement (two ways):
2.2 Estimating reliability
Parallel measurements
1. alternate forms: two different tests for the same construct
2. test-retest: same test at two different times
3. split-half: two parallel half-tests
Alternate forms:
Requirement: two measurements must be parallel, they should:
- measure exactly the same true scores
- have identical error variances
, Consequences:
- identical observed variances:
- identical correlations with true score
Problems:
1. Are tests really parallel? Never certain
Partial solutions: Domain sampling & consequences of parallelness
2. Carry-over effects (taking test 1 can influence results of test 2 -> can lead to
correlation between tests being too high -> overestimation of reliability)
Test-retest
= more plausible with test-retest than with parallel forms. Shouldn’t a test be parallel to
itself? Yes, but..
- People change: lower rxy -> underestimation of reliability -> short time between
test and retest!
- Carry-over effects: perhaps even stronger than with parallel tests -> can lead to
correlated errors or change in error variance (over- or underestimation) -> long
time between test and retest!
Split-half (from half-test to total test)
= correlation between (parallel) half-test -> reliability of half-tests, but we want reliability for
whole test, so..
Spearman-Brown formula
= gives effect on reliabilty of lengthening (or shortening) the test
“What whould be the reliability of the lengthened test if test with known reliability is made
n times as long?”
n = number of test halves
Problems:
- Parallelness
- Many splits are possible
Limited solutions:
- Most parallel half-tests
- Parallel item-pairs
- Evaluation of solution (split-half is not used often)