Test Theory
WEEK 1
Lecture 1 Introduction and basic knowledge of statistics
Why are we here?
• Psychology is one of the most exciting sciences • We study one of the most complex systems in the world
• It is an empirical science, so research revolves around observations
• As an empirical science it has one big problem: – Everything we find interesting is not directly observable!
Tests as the saviors of Psychology
• We study not-directly observable (= latent) properties
• With psychological tests we hope to measure them→we can barely see anything without these tools
• Psychology without tests is like astronomy without telescopes: – We can barely ‘see’ anything without these tools
• Psychologist without tests ≈ doctor without instruments: – Not enough information to provide proper treatment
Test Theory
• Developing & ensuring high-quality psychological tests is essential
– Without high-quality tests Psychology is barely a science
– Without high-quality tests Psychology can contribute little to society
• An entire discipline within Psychology is devoted to researching and improving the quality of tests and how we can
evaluate this quality – Test Theory (also called Psychometrics)
Measuring narcissism with one question?
Narcissus, went to the forest and saw his own reflection, he fell in love with himself, so obsessed he couldn't leave the
pond where he was mirroring himself
To measure narcissism→Are you a narcissist? (So self obsessed that they don’t see anything wrong with saying they are
narcissist\they love themselves→BUT There’s no good quality single question that measures completely a construct
Test yourself: How perfectionistic are you?
Online test for perfectionism→nine-item test (often=3 points-sometimes=2-rarely=1)
sometimes-often-often-sometimes- sometimes- often-sometimes-sometimes-often→22
Offered interpretation of test score:
• 22-27: Category RED - Red alert!→according to this website perfectionism is impairing your everyday life and wellbeing
• 15-21: Category ORANGE - Watch out! • 9-14: Category GREEN - Everything is fine!
→It is a psychological test and we got results but it’s not high quality→Why?
Introduction
• When assessing individuals you generally use a test with a lot of items (9 in this case, usually more)
• Items as indicators for the construct (perfectionism)
1. Answers are assigned scores (item scores) Items actually say something on where we stand on perfectionism scale
2. Item scores are transformed to test scores (generally sum scores) →scores are combined
3. Test scores are interpreted
But what can you say about someone’s perfectionism on the basis of the test scores? To what extent does the test say
something meaningful about the construct we want to measure?
• Is it possible to interpret the test scores in a meaningful way?
• Does the test in fact measure perfectionism? (and not conscientiousness) How do you find out?
• Are the questions/instructions of good quality? • Are there enough questions in the test?
• One person has a score of 14 and another person has a score of 15. Is this difference large enough to conclude that they
differ in their perfectionism? Is the test precise/is the power high enough to say this?
Triangle of positions towards psychological testing 3 extreme attitudes towards psychological testing
Disinterested→everything related to measuring psychological constructs with tests is completely uninteresting
Believer→Directly believes the test score and its interpretation without a second thought
Skeptic→Believes that it is impossible to use psychological tests to say something about psychological constructs
Best→somewhere in between believer and skeptic→want to measure things while being aware it is tricky
• Tests are used a lot for measurement in Psychology, generally the most convenient way to collect data
• If you want to measure well and accurately you need a good test. Otherwise sloppy science!→questionable research
• Unfortunately there are no clear-cut rules for creating a good test. It requires constant thought, good knowledge of the
property you want to measure, proper use of statistical methods.
1
,• There are tools: This is what Test Theory deals with. How can I get a good test, evaluate its quality and improve it?
Introduction: Examples of use of Psychological tests
1. Clinical psychologist: psychological disorders → Facilities new prisoners.
2. Education psychologist: learning abilities → placement of children in the correct types of secondary education (CITO)
3. Social psychologist: affection → scientific research
4. Occupational psychologist (HRM): intelligence → filling job vacancy for manager
5. Cultural psychologist: individualism/collectivism → explaining cultural differences
6. Developmental psychologist: attitude concerning upbringing → advice to parents with problem children
7. Teacher: student-mastery of domain of test theory→passing/failing (mastered/not mastered the domain of test theory)
Introduction: Test Theory very important course
Science: • Psychological research is about non-directly observable properties
• Tests are always necessary to measure these properties
• (Almost) all psychological theories created because of test theory (Big Five, intelligence)
• Guaranteed relevance for your bachelor thesis research! Deal with real data from psychological tests
Practice: • Most professions in social sciences make use of test results
• Often decision making based on psychological tests • You need to be able to assess the value of tests
• Critically evaluate and improve self-developed tests
• Knowledge of possibilities and limitations of psychological tests of great importance in almost all professions
Test Theory is very much an academic course • Not a cookbook course, always keep thinking critically
• Combination of statistics, substantive theory, experience, and creativity
Basic knowledge of statistics
Average, Variance, Standard deviation, z-scores, covariance, correlation
Respondent 1 has a realization of 6 on the variable X, range 1-10
Variable Y→only respondent 4 got that answer right (0=wrong, 1=right)
Average and deviation score
Average→what is the average performance on a variable (notation is arbitrary, labeling doesn’t matter as long it’s used
consistently) N=sample size Summation sign (Sigma)=sum over what follows, fill in i with numbers, starting by 1 in this
case, look at Xi (i=1→X=6, i=2→X=9)
Deviation score→relocate the scores, recentering them so that we have the new average by zero again→score tells if you
scored below or above the mean (x=-2→2 units below the mean, but is scoring 2 points lower bad or really bad? It’s not a
standardized value yet, not a value we can interpret without having a context→z value) average here=zero
Average Variance z-score
→Average and deviation score (= centering) To standardize things:
Variance→square it for every person→x²
We use N and not n-1 because we are talking (in this course) of descriptive statistics (we focus on the sample we are
studying, we don-t consider the whole population) If we talk about the population→inferential statistics
Standard deviation→can be interpreted as the standard deviation of the average (not fully correct)
Standardized scores (or z-scores) →after calculating it we can know how many standard deviations did these people score
below or above the mean on these 3 variables
Bivariate statistics→Covariance and correlation→Association between variables
Covariance→measure of shared variance (letter s to all the variance related things here
2
,-book→c) possible shared variance between 2 variables which is unstandardized, but doesn’t have a direct meaning, eg
covariance of 10 says the direction of the link, but not how strong it is, we study it because we need it for the correlation
Correlation→ 0=no link at all, what i score on x tells me nothing about scores on y → we can compare correlations
Variance-covariance matrix
→Variances on the diagonal elements, left to right
→Covariances on the off-diagonal elements (everything is not on that diagonal)
Correlation matrix
Ones on the diagonal elements, correlations on the off-diagonal elements
Lecture 2: Properties of tests and items
What is a psychological test?
• Cronbach (1960): ‘a systematic procedure for comparing the behavior of two or more people’
• This ‘procedure’ can take on many forms:
– Multiple-choice aptitude test – Personality test with open-ended questions
– Systematic behavioral observation – Rorschach inkblot test
• Three crucial properties:
– Aimed at measuring behavior (observable)
– Systematic (objective, test doesn't depend on personal perception, when smn else would have found a different solution)
– Comparison of different people∗ (comparative, always you in relation to someone else, always comparing data)
∗Or of people over time! → compare me and future me (longitudinal studies)
Type of tests
• Tests for maximum performance vs. ‘typical’ performance:
– Maximum performance tests for measuring skills/aptitude (see what you’re capable of, by showing maximum capacities)
→doing better or worse on each item
– Typical performance tests for measuring personality traits, attitudes, disorders (negative definition: anything is not a
maximum performance test, there’s no right or wrong answer, correct for you not in general→trying to get a picture of what
is typical for you)
– Big differences in the approach of test development
– Few differences in the statistical analysis of test scores, once we have the data, doesn’t matter too much if the test was
maximum and typical, strategies can be applied the same way, that's more the content side, on a statistical level it doesn’t
make much difference.
• Two types of maximum performance tests: ‘Power’ and ‘speed’ tests
– Power tests measure skill without time pressure (most common) →measuring ability, showing the maximum you’re
capable of, our exams at university for example→Expect more skilled people give more correct answers (=higher ability)
Items have a certain difficulty
– Speed tests measure performance under severe time pressure, also try to get maximum performance
All items are easy→ Question difficulty is trivial (most people can answer them correct when given enough time)
→ More skilled people answer more questions within time limit
Example: Bourdon dot concentration test (speed test)
Go as quick as you can through to a table and circle all the sets of dots that have 4 dots, it is easy given enough time, but
with limited time is difficult, many jobs ask to do well under time pressure, quickly process visual stimuli (make speed
second decisions, for example if you are a plane driver)
• Norm-referenced or criterion-referenced tests: What do we do with the test scores once we obtain it
– Norm-referenced tests compare people to the rest of the population (Norm set based on the population, need to know
how you relate to the population, to do so→) Good norm data on this population of great importance
Compare to a norm that we derive to the population
– Criterion-referenced tests compare people with an absolute standard, an external criteria not linked to the norm/standard
score of the population→In many situations we don’t want to know if you are in the best 10%, but the ability, how an
3
, individual relates to an absolute standard, rather where it stands in a population→Test inferences not tied to performance
level in the population
E.g.: Exam Test Theory→how well you have mastered the material, not how well you scored compared to the others
Could be that the criterion is based on psychological studies based on a population, but you have to look at your goal, if
you want to see how you stand compared to others is a norm referenced test, if not is a criterion referenced test, even
though the criterion is based on normative data.
What does a psychological test contain?
• Test material→physical object, code on a computer→the literal test
• Test forms→the way of responding, tool for registering the responses (separate from the test material)
• Test manual→documentation that provides crucial information about the test to know before use it, contains:
1. Precise test instructions→systematic/objective, everyone that uses the test does it in same way, follow clear instructions
exactly what you need to do under precise conditions
2. Score-processing procedure→how to go from qualitative to quantitative responses
3. Norm tables→information about the populations to use norm tests to compare individuals to the norm
4. Discussion of scientific qualities→can’t use an IQ test without knowing what framework of intelligence you are using,
whether it matches or not with your purposes→most tests don’t have it, but high quality tests need all of this
Example test material
Rakt test→measure intelligence in small kids (3y old) all standard testing doesn’t work→need to be creative
Many subtests to measure different forms of intelligence
eg. Many images incomplete, kids need to say what they represent (test material)
Example (fictitious) test form
Instructions say that the last answer of what the picture is the final answer→every
answer correct is 1, incorrect gives a 0. The step answer to score is the assessment.
Item scores are determined such that they are indicative of the construct you want to measure: higher item scores =
‘higher’ on that attribute (maximum performance test, but for typical tests we have contra indicative items, agreeing to that
statement should give a lower score than disagreeing to that statement→need to recode to have an indicative result)
Properties of the test score
• Test score is generally the sum of the item scores→ Most important outcome of the test that is used
• Test manual gives instructions on how to interpret the score (see lecture 3)
• With norm-referenced tests, norm table needs to be consulted (without comparing it to an absolute standard→kid will
have problems later in life)
• e.g.: 30% of boys aged 3 have a score lower than 3 (30th percentile)
Measurement level test score
• Test score is a number (but this doesn’t mean what we measure is quantitative, 4 is not double than 2)
• Interpretation of this number depends on the level of measurement of the test score:
– Nominal (e.g. personality types) no better or worse, no order, just categories →qualitative
– Ordinal (e.g. short Likert scales) ordered categories, being low-lower (rate how often 1-5) qualitative
– Interval (e.g. long Likert scales?) same interpretation to the same intervals→quantitative measures
– Ratio (e.g. Bourdon dot test?) interval+meaningful zero point (you can be twice as fast as someone else)
Test scores with interval level of measurement?
• Scores are only of interval (or ratio) level of measurement if they are ‘quantitative’:
– An increase of 1 score point always needs to reflect the same specific increase in the property you are measuring, in the
underlying construct Eg. • Person A, B, and C, with introversion scores 10, 20, and 30
– Score difference between A en B and between B en C of equal size
– Not obvious that differences in introversion are comparable! Differences dont have the same meaning.
Cannot go from a ordinal scale to a quantitative scale→skeptical of interval interpretation of Likert scales
• Test scores are (usually) the sum of item scores
• Item scores evidently ordinal
4
WEEK 1
Lecture 1 Introduction and basic knowledge of statistics
Why are we here?
• Psychology is one of the most exciting sciences • We study one of the most complex systems in the world
• It is an empirical science, so research revolves around observations
• As an empirical science it has one big problem: – Everything we find interesting is not directly observable!
Tests as the saviors of Psychology
• We study not-directly observable (= latent) properties
• With psychological tests we hope to measure them→we can barely see anything without these tools
• Psychology without tests is like astronomy without telescopes: – We can barely ‘see’ anything without these tools
• Psychologist without tests ≈ doctor without instruments: – Not enough information to provide proper treatment
Test Theory
• Developing & ensuring high-quality psychological tests is essential
– Without high-quality tests Psychology is barely a science
– Without high-quality tests Psychology can contribute little to society
• An entire discipline within Psychology is devoted to researching and improving the quality of tests and how we can
evaluate this quality – Test Theory (also called Psychometrics)
Measuring narcissism with one question?
Narcissus, went to the forest and saw his own reflection, he fell in love with himself, so obsessed he couldn't leave the
pond where he was mirroring himself
To measure narcissism→Are you a narcissist? (So self obsessed that they don’t see anything wrong with saying they are
narcissist\they love themselves→BUT There’s no good quality single question that measures completely a construct
Test yourself: How perfectionistic are you?
Online test for perfectionism→nine-item test (often=3 points-sometimes=2-rarely=1)
sometimes-often-often-sometimes- sometimes- often-sometimes-sometimes-often→22
Offered interpretation of test score:
• 22-27: Category RED - Red alert!→according to this website perfectionism is impairing your everyday life and wellbeing
• 15-21: Category ORANGE - Watch out! • 9-14: Category GREEN - Everything is fine!
→It is a psychological test and we got results but it’s not high quality→Why?
Introduction
• When assessing individuals you generally use a test with a lot of items (9 in this case, usually more)
• Items as indicators for the construct (perfectionism)
1. Answers are assigned scores (item scores) Items actually say something on where we stand on perfectionism scale
2. Item scores are transformed to test scores (generally sum scores) →scores are combined
3. Test scores are interpreted
But what can you say about someone’s perfectionism on the basis of the test scores? To what extent does the test say
something meaningful about the construct we want to measure?
• Is it possible to interpret the test scores in a meaningful way?
• Does the test in fact measure perfectionism? (and not conscientiousness) How do you find out?
• Are the questions/instructions of good quality? • Are there enough questions in the test?
• One person has a score of 14 and another person has a score of 15. Is this difference large enough to conclude that they
differ in their perfectionism? Is the test precise/is the power high enough to say this?
Triangle of positions towards psychological testing 3 extreme attitudes towards psychological testing
Disinterested→everything related to measuring psychological constructs with tests is completely uninteresting
Believer→Directly believes the test score and its interpretation without a second thought
Skeptic→Believes that it is impossible to use psychological tests to say something about psychological constructs
Best→somewhere in between believer and skeptic→want to measure things while being aware it is tricky
• Tests are used a lot for measurement in Psychology, generally the most convenient way to collect data
• If you want to measure well and accurately you need a good test. Otherwise sloppy science!→questionable research
• Unfortunately there are no clear-cut rules for creating a good test. It requires constant thought, good knowledge of the
property you want to measure, proper use of statistical methods.
1
,• There are tools: This is what Test Theory deals with. How can I get a good test, evaluate its quality and improve it?
Introduction: Examples of use of Psychological tests
1. Clinical psychologist: psychological disorders → Facilities new prisoners.
2. Education psychologist: learning abilities → placement of children in the correct types of secondary education (CITO)
3. Social psychologist: affection → scientific research
4. Occupational psychologist (HRM): intelligence → filling job vacancy for manager
5. Cultural psychologist: individualism/collectivism → explaining cultural differences
6. Developmental psychologist: attitude concerning upbringing → advice to parents with problem children
7. Teacher: student-mastery of domain of test theory→passing/failing (mastered/not mastered the domain of test theory)
Introduction: Test Theory very important course
Science: • Psychological research is about non-directly observable properties
• Tests are always necessary to measure these properties
• (Almost) all psychological theories created because of test theory (Big Five, intelligence)
• Guaranteed relevance for your bachelor thesis research! Deal with real data from psychological tests
Practice: • Most professions in social sciences make use of test results
• Often decision making based on psychological tests • You need to be able to assess the value of tests
• Critically evaluate and improve self-developed tests
• Knowledge of possibilities and limitations of psychological tests of great importance in almost all professions
Test Theory is very much an academic course • Not a cookbook course, always keep thinking critically
• Combination of statistics, substantive theory, experience, and creativity
Basic knowledge of statistics
Average, Variance, Standard deviation, z-scores, covariance, correlation
Respondent 1 has a realization of 6 on the variable X, range 1-10
Variable Y→only respondent 4 got that answer right (0=wrong, 1=right)
Average and deviation score
Average→what is the average performance on a variable (notation is arbitrary, labeling doesn’t matter as long it’s used
consistently) N=sample size Summation sign (Sigma)=sum over what follows, fill in i with numbers, starting by 1 in this
case, look at Xi (i=1→X=6, i=2→X=9)
Deviation score→relocate the scores, recentering them so that we have the new average by zero again→score tells if you
scored below or above the mean (x=-2→2 units below the mean, but is scoring 2 points lower bad or really bad? It’s not a
standardized value yet, not a value we can interpret without having a context→z value) average here=zero
Average Variance z-score
→Average and deviation score (= centering) To standardize things:
Variance→square it for every person→x²
We use N and not n-1 because we are talking (in this course) of descriptive statistics (we focus on the sample we are
studying, we don-t consider the whole population) If we talk about the population→inferential statistics
Standard deviation→can be interpreted as the standard deviation of the average (not fully correct)
Standardized scores (or z-scores) →after calculating it we can know how many standard deviations did these people score
below or above the mean on these 3 variables
Bivariate statistics→Covariance and correlation→Association between variables
Covariance→measure of shared variance (letter s to all the variance related things here
2
,-book→c) possible shared variance between 2 variables which is unstandardized, but doesn’t have a direct meaning, eg
covariance of 10 says the direction of the link, but not how strong it is, we study it because we need it for the correlation
Correlation→ 0=no link at all, what i score on x tells me nothing about scores on y → we can compare correlations
Variance-covariance matrix
→Variances on the diagonal elements, left to right
→Covariances on the off-diagonal elements (everything is not on that diagonal)
Correlation matrix
Ones on the diagonal elements, correlations on the off-diagonal elements
Lecture 2: Properties of tests and items
What is a psychological test?
• Cronbach (1960): ‘a systematic procedure for comparing the behavior of two or more people’
• This ‘procedure’ can take on many forms:
– Multiple-choice aptitude test – Personality test with open-ended questions
– Systematic behavioral observation – Rorschach inkblot test
• Three crucial properties:
– Aimed at measuring behavior (observable)
– Systematic (objective, test doesn't depend on personal perception, when smn else would have found a different solution)
– Comparison of different people∗ (comparative, always you in relation to someone else, always comparing data)
∗Or of people over time! → compare me and future me (longitudinal studies)
Type of tests
• Tests for maximum performance vs. ‘typical’ performance:
– Maximum performance tests for measuring skills/aptitude (see what you’re capable of, by showing maximum capacities)
→doing better or worse on each item
– Typical performance tests for measuring personality traits, attitudes, disorders (negative definition: anything is not a
maximum performance test, there’s no right or wrong answer, correct for you not in general→trying to get a picture of what
is typical for you)
– Big differences in the approach of test development
– Few differences in the statistical analysis of test scores, once we have the data, doesn’t matter too much if the test was
maximum and typical, strategies can be applied the same way, that's more the content side, on a statistical level it doesn’t
make much difference.
• Two types of maximum performance tests: ‘Power’ and ‘speed’ tests
– Power tests measure skill without time pressure (most common) →measuring ability, showing the maximum you’re
capable of, our exams at university for example→Expect more skilled people give more correct answers (=higher ability)
Items have a certain difficulty
– Speed tests measure performance under severe time pressure, also try to get maximum performance
All items are easy→ Question difficulty is trivial (most people can answer them correct when given enough time)
→ More skilled people answer more questions within time limit
Example: Bourdon dot concentration test (speed test)
Go as quick as you can through to a table and circle all the sets of dots that have 4 dots, it is easy given enough time, but
with limited time is difficult, many jobs ask to do well under time pressure, quickly process visual stimuli (make speed
second decisions, for example if you are a plane driver)
• Norm-referenced or criterion-referenced tests: What do we do with the test scores once we obtain it
– Norm-referenced tests compare people to the rest of the population (Norm set based on the population, need to know
how you relate to the population, to do so→) Good norm data on this population of great importance
Compare to a norm that we derive to the population
– Criterion-referenced tests compare people with an absolute standard, an external criteria not linked to the norm/standard
score of the population→In many situations we don’t want to know if you are in the best 10%, but the ability, how an
3
, individual relates to an absolute standard, rather where it stands in a population→Test inferences not tied to performance
level in the population
E.g.: Exam Test Theory→how well you have mastered the material, not how well you scored compared to the others
Could be that the criterion is based on psychological studies based on a population, but you have to look at your goal, if
you want to see how you stand compared to others is a norm referenced test, if not is a criterion referenced test, even
though the criterion is based on normative data.
What does a psychological test contain?
• Test material→physical object, code on a computer→the literal test
• Test forms→the way of responding, tool for registering the responses (separate from the test material)
• Test manual→documentation that provides crucial information about the test to know before use it, contains:
1. Precise test instructions→systematic/objective, everyone that uses the test does it in same way, follow clear instructions
exactly what you need to do under precise conditions
2. Score-processing procedure→how to go from qualitative to quantitative responses
3. Norm tables→information about the populations to use norm tests to compare individuals to the norm
4. Discussion of scientific qualities→can’t use an IQ test without knowing what framework of intelligence you are using,
whether it matches or not with your purposes→most tests don’t have it, but high quality tests need all of this
Example test material
Rakt test→measure intelligence in small kids (3y old) all standard testing doesn’t work→need to be creative
Many subtests to measure different forms of intelligence
eg. Many images incomplete, kids need to say what they represent (test material)
Example (fictitious) test form
Instructions say that the last answer of what the picture is the final answer→every
answer correct is 1, incorrect gives a 0. The step answer to score is the assessment.
Item scores are determined such that they are indicative of the construct you want to measure: higher item scores =
‘higher’ on that attribute (maximum performance test, but for typical tests we have contra indicative items, agreeing to that
statement should give a lower score than disagreeing to that statement→need to recode to have an indicative result)
Properties of the test score
• Test score is generally the sum of the item scores→ Most important outcome of the test that is used
• Test manual gives instructions on how to interpret the score (see lecture 3)
• With norm-referenced tests, norm table needs to be consulted (without comparing it to an absolute standard→kid will
have problems later in life)
• e.g.: 30% of boys aged 3 have a score lower than 3 (30th percentile)
Measurement level test score
• Test score is a number (but this doesn’t mean what we measure is quantitative, 4 is not double than 2)
• Interpretation of this number depends on the level of measurement of the test score:
– Nominal (e.g. personality types) no better or worse, no order, just categories →qualitative
– Ordinal (e.g. short Likert scales) ordered categories, being low-lower (rate how often 1-5) qualitative
– Interval (e.g. long Likert scales?) same interpretation to the same intervals→quantitative measures
– Ratio (e.g. Bourdon dot test?) interval+meaningful zero point (you can be twice as fast as someone else)
Test scores with interval level of measurement?
• Scores are only of interval (or ratio) level of measurement if they are ‘quantitative’:
– An increase of 1 score point always needs to reflect the same specific increase in the property you are measuring, in the
underlying construct Eg. • Person A, B, and C, with introversion scores 10, 20, and 30
– Score difference between A en B and between B en C of equal size
– Not obvious that differences in introversion are comparable! Differences dont have the same meaning.
Cannot go from a ordinal scale to a quantitative scale→skeptical of interval interpretation of Likert scales
• Test scores are (usually) the sum of item scores
• Item scores evidently ordinal
4