Writing and Evaluating Test Items
This section focuses on how to create a psychological measure that is valid and reliable.
Test/questionnaire are interchangeable. The questions we want to ask are called items, and make up the
questionnaire. When creating a reliable and valid test, there are many things that must be considered.
Choice of format comes from objectives and purpose of the test (what you want to measure/answer)
EXAMPLE 1: Intergroup contact
Do we want to measure intergroup contact in work settings, or social settings?
Do we want to measure extent/amount of intergroup contact, or do we want to measure
quality of intergroup contact?
EXAMPLE 2: Road rage
Are we aiming to measure the amount of road rage experienced in any given situation
on the road?
Or do we want to measure factors that predispose people to experiencing road rage?
Item writing guidelines
Define clearly what you want to measure
You can’t construct test/questionnaire items if you don’t know what it is you are trying to measure
Generate an item pool
Best items are selected from the pool after item analysis
Avoid long items
Impacts processing speed and causes confusion when it is tedious to read – respondents may not get what you
are trying to ask properly as they skim over the question
E.g., When you are in a stressful situation would you be inclined to talk to friends and family about the things
that are making you feel stressed out the moment you feel overwhelmed?
Keep the reading difficulty appropriate
Use appropriate vocabulary
Avoid using jargon that would not be known to the sample
Take your sample’s level of education into account
Use clear & concise wording (grammatically correct)
Avoid double-barreled items (increases processing speed) (E.g., Is this tool interesting and useful?)
Avoid double-negatives (E.g., It is not unfortunate that…)
Mix positively and negatively worded items in the same test
Can help prevent response sets (asking the opposite of what you are trying to measure to make sure they are
paying attention) E.g., I felt depressed vs. I felt hopeful about the future
, Make sure your items are as culturally neutral as possible
MCQ items
Vary position of correct answer
All distractors (name for number of incorrect answers) plausible
True/false Qs
Both statements same length
Equal numbers of both
Make the content relative to the purpose. E.g., A personality test would not ask what the capital of South
Africa is, but a measure of South African national identity might
Item formats
One of the first questions we ask is what kind of test do we want to construct? What format do we want the
responses to take?
Dichotomous format
Polytomous format
The Likert format
The category format
Checklists and Q-sorts
Dichotomous format
Di = 2, therefore 2 alternatives. True/False; Yes/No = most common. Only has one distractor. Choose btwn 2
Major advantage
Ease of administration and quick scoring
Requires absolute judgment (E.g., I often experience road rage.) Do you need them to/is it useful
Alternative way = I experience road rage: 1. Never 2. Sometimes 3. Often 4. Very often. This is the likert
format - more sensitive to discrimination, but can be ambiguous (what does often mean)
Major disadvantages
Much less reliable (when considering marks and accuracy it is easier to guess. better to have more options so
not 50% chance simply by guessing even if don't understand.)
50% chance of getting an item correct
Less range of scores when it comes do doing analyses
Encourages memorization
Can outperform understanding
Often truth comes in shades of grey and not black and white
Not always easy to set questions/items in this format
E.g., Would you describe yourself as empathic?
Polytomous format
More than 2 alternatives
E.g., MCQ questions: Where is Egypt?
, A. Next to Iran
B. At the top of Africa
C. In Europe
D. In South America
Distractors = incorrect alternatives
Test can retain its psychometric properties with as few as 3
Must ensure that distractors are as clearly written and as plausible as the correct answer
Avoid cute distractors
4 alternatives commonly used in educational settings
Psychometric theory suggests distractors = more reliable
However, difficult to find many good distractors
3-4 good distractors seem to be ideal
Advantages similar to dichotomous format
Easy to administer and score
Requires absolute judgement
BUT: More reliable than dichotomous and less chance of guessing correctly
Correction for guessing:
Because guessing can lead to higher scores, corrected scores are sometimes used.
R – (W/(n – 1))
No. of right answers - (no. wrong answers divided by no. of choices for each item - 1)
R = number of correct
W = number of wrong
N = number of alternatives
Omitted answers are excluded in this calculation
Example: Polytomous format, 4 alternatives
40 item MCQ test, person gets 27 correct (13 incorrect)
27 – (13/(4-1))
So 22,66/40 (57%) instead of 27/40 (68%)
Example: Dichotomous format MCQ
30 correct, 7 incorrect, and 3 answers omitted
30- (7/(2-1)) = 23
So 23/40 (57,5%) instead of 30/40 (75%)
Likert Format
Indicates degree of agreement (E.g., strongly disagree, disagree, neutral, agree, strongly agree)
6-point scale (or even number of options) used to avoid the neutral response (forces an opinion by not having
a neutral ‘middle’)
, Reverse score negatively-worded items (E.g., Higher score indicates you like what is being measured. If you
have a negative reverse question that they agree with, reverse score by giving a 1 instead of 5. SO, if you need
to reverse score: 5 becomes a 1 and 1 becomes a 5. 4 becomes a 2 and 2 becomes a 4 etc.)
Questions/issues:
How many responses is best?
Items should not be questions, but statements
E.g., Not Do you think that…/ Rather I think that…
The Category Format
On a scale of 1 to 10…
Why 10?
Research suggests 7 best
Problems
Tendency to spread responses across all categories
Susceptible to the groupings of things being rated (context)
Element of randomness
Use when?
People are highly involved with a subject
E.g., asking people in townships to rate service delivery
More motivated to make a finer discrimination
Want to measure the amount of something
E.g., road rage experienced in a given situation
Make sure your endpoints are clearly defined
Visual analogue scale
Better than likert as it is ranked data and the difference is more
meaningful.
Checklists and Q-sorts
Checklists
Common in personality measures
A list of adjectives, check which ones describe you best
Q-sort
Place statements into piles and then rank in order of
importance/agreement within each category.
Piles indicate degree to which you think a statement
describes a person/yourself
Category format implicit here
More qualitative in nature (more subjective questions)
Tend to have a normal distribution of the statements with
many categories.