Topics in assessment & selection (Van Iddekinge et al., 2023)
- Reliability and validity of selection methods
- Building, developing, and validating methods
- Fairness and test bias in selection methods
- Utility and decision-making in selection
- Applicant reactions
- Gamification
- AI in selection
Course aim: solve ‘the supreme problem’
“Psychologists should help in the supreme problem of diagnosing each individual, and
steering him toward his fittest place” (Hall, 1917, p. 11)
- What constructs (e.g., cognitive ability, personality) can predict important outcomes
such as job performance? → Can we predict behaviour?
- What methods should be used to measure these constructs?
- How do we ensure these methods are fair and unbiased?
- How do we use the methods to make decisions?
What is a good measure?
Measures must meet a lot of criteria to be useful
- COTAN = Commissie Testaangelegenheden Nederland → Checks the key criteria
Principles of test construction:
- Quality of test material
- Quality of a manual
- Standardization and norms
- Reliability
- Construct validity
- Criterion validity (tomorrow’s lecture)
Reliability
Reliability: “The degree to which measures are free from error and yield consistent results”
- In classical test theory (CTT), X = T + E (observed score = true score + error)
Example: My friends and I went fishing. I caught a big one.
- We wanted to know the unknown fish’s true weight (true score, T)
- But we could only obtain the fish’s observed weight (observed score, X)
- We took multiple measurements that were not identical due to error (E)
Errors in personnel selection → environment, examiner (rater), method (instrument) etc.
Types of reliability
- Test-retest: consistency in scores over time
- Parallel forms: Equivalence of two versions of the same test → correlation
Internal consistency: how well the items in a test measure the same underlying concept.
- Split-half approach → dividing into two and correlating the scores from these halves.
- Coefficient alpha ⍺ (average of all possible split-halfs)
,Inter-rater reliability (IRR): degree to which raters give consistent scores of the same thing.
- Consistency (r)
- Agreement (Kappa)
- Intraclass coefficients (ICCs, Shrout & Fleiss, 1979)
Alpha coefficient ⍺
Is based on:
- A single administration of a test
- (Co-)variances of the items → variation within and relationships between test items.
- Number of items → Higher ⍺ if there are more items
Interpretation of alpha coefficient ⍺
Reliability is a characteristic of a measurement, not a method (e.g., questionnaire)
If individual diagnosis, then COTAN standards:
- rₓₓ < .80 is insufficient → Too much measurement error for individual statements.
- .80 ≤ rₓₓ >.90 is sufficient → Reliable enough for individual interpretation.
- rₓₓ ≥ .90 is good → Highly reliable measurement, suitable for individual diagnosis.
→ rₓₓ = reliability coefficient, indicating proportion of true-score variance in observed scores.
In research, rₓₓ = .60 or .70 can sometimes be used with caution.
What alpha is not
- A measure of uni-dimensionality (Schmitt, 1996).
- An indicator of the extent to which we measure what we want to measure.
Inter-class coefficient (ICC)
The inter-class coefficient ICC is a correlation coefficient that assesses the consistency
between measures of the same class. → Between rater.
How reliable are the ratings from multiple raters?
- 3 clinical psychologists rate the behavior of children with special needs
- 5 court judges estimate the likelihood of a defendant to recommit a crime
- 4 consultants rate candidates’ behavior in an interview
,ICC differs by study design (Shrout & Fleiss, 1979)
- 6 ICC types: ICC (1, A), ICC (2, A), ICC (3, A), ICC (1, B), ICC (2, B), ICC (3, B)
- Each subject is rated by a different, randomly selected rater → 1
- A random sample of k raters rates all subjects → 2
- The same fixed set of raters rate all subjects → 3
- Are ratings by all raters averaged at the end? → B (sometimes called k) → If
not, A
To determine A or B, check how ratings are used in practice
- A reliability study may use multiple raters, but in practice, only one rater may be
available (only one person conducts an interview) → A
- Panel interview: Multiple interviewers rate independently, and the total score is the
average across interviewers → B
→ Often people do not want to rate individually, they want to consult each other.
ICC interpretation (KOO & LI, 2016)
Interpretation
- Below 0.50: poor
- Between 0.50 and 0.75: moderate
- Between 0.75 and 0.90: good
- Above 0.90: excellent
Confidence intervals (CI)
- We are interested in the uncertainty of the true score.
- With repeated sampling, % of the time the CI contains the true score.
CI = X ± z * SEM
- X = Test score
- z = test statistic (e.g., 1.96 for a 95% CI)
- SEM = standard error of measurement = σ * √(1 - rₓₓ), where:
- σ is the standard deviation of observed test scores
- rₓₓ = reliability of the test
As the reliability lowers, the confidence interval rises.
Validity
Validity: “The extent to which a test measures what it should measure”.
Types of validity
- Face validity = “does it look like a measure relevant for job performance?”
- Content validity = “Does a measure represent all facets of a given construct?”
- Construct validity = how well a test measures the underlying theoretical concept?
- Convergent validity
- Discriminant/ divergent validity
- Criterion-related validity = how well test scores relate to external criterion or outcome.
- Concurrent validity
- Predictive validity
, Construct validity
→ To what extent is the test a good measure of the underlying theoretical concept?
Internal structure
- Number of dimensions (factors) → factor analysis!
- Expected group differences (more therapy for people high vs. low on neuroticism).
External structure
- Convergent validity: correlation between two measures of constructs that
theoretically should be correlated → E.g.: workaholism and health problems.
- Divergent validity: no correlation between two measures of constructs that
theoretically should not be correlated → E.g.: cognitive ability and agreeableness.
Factor analysis (FA)
Factor analysis (FA) is useful for revealing (exploratory FA) or verifying (confirmatory FA) the
underlying dimension of a newly developed measure.
- Does our scale measure separate subdimensions or is it unidimensional?
In this course, we will only cover exploratory FA
- Summarize data by grouping together variables that are correlated.
- Typically used in the early stages of research, to consolidate variables.
Types of exploratory FA:
- Principal Components (PC): All variance in observed variables is analyzed. Variables
‘cause’ components
- Factor analysis (FA): Only shared variance is analyzed. Error variance is eliminated.
Factors ‘cause’ variables/items.
Imagine three items: “I can delay gratification”, “I avoid eating ice cream even if I would like
it”, “I don’t go clubbing if I have an exam the next day”. What could be an underlying factor
that ‘causes’ item responses? → Self control
with PC the arrows would be reversed!
We look for variables in a correlation matrix that ‘cluster together’
- So, matrices with correlatiecoëfficiënt r close to 0 are problematic. No clusters!
To check if our correlation matrix is appropriate for factor analysis?
- Bartlett’s test of sphericity: tests if correlations are zero, but is notoriously sensitive to
N (→ not reliable with large N).