100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

[LANGUAGE TESTING AND EVALUATION 10] GÁBOR SZABÓ Applying Item Response Theory in Language Test Item Bank Building

Rating
-
Sold
-
Pages
197
Grade
A+
Uploaded on
17-10-2021
Written in
2021/2022

[LANGUAGE TESTING AND EVALUATION 10] GÁBOR SZABÓ Applying Item Response Theory in Language Test Item Bank Building Table of contents Acknowledgements 9 Introduction 11 1. Measurement Theory 13 1.1. General Concerns 13 1.2. Educational Measurement 15 1.3. Classical Test Theory 17 1.3.1. The True Score Model 17 1.3.2. Reliability 19 1.3.3. Validity 23 1.3.4. Traditional Item and Person Statistics 29 1.3.5. Summary 39 1.4. Item Response Theory (IRT) 40 1.4.1. Central Concepts 40 1.4.2. IRT Assumptions 44 1.4.3. IRT Models 46 1.4.3.1. The One-Parameter Logistic Model—The Rasch Model 47 1.4.3.2. The Two-Parameter Logistic Model 48 1.4.3.3. The Three-Parameter Logistic Model 49 1.4.4. Estimation of Item Difficulty and Person Ability 52 1.4.5. Model-Data Fit Statistics 53 1.4.6. Practical Applications of IRT Models 56 1.4.7. Problems with IRT 60 1.4.8. Summary 63 1.5. Applications of Measurement Theory in Language Testing 64 1.5.1. Classical Test Theory in Language Testing 65 1.5.2. Item Response Theory in Language Testing 67 2. Building a Language Testing Item Bank at UP 73 2.1. Background 73 2.2. Research Questions 74 2.3. Stages of Development 75 2.3.1. The Beginnings 75 2.3.1.1. Specifications 76 2.3.1.2. Structural Buildup 77 2.3.2. Modifications 84 2.3.3. Evaluation 86 7 2.4. Stages of Moderation 89 2.4.1. Piloting 89 2.4.2. Applying Classical Test Analysis 91 2.5. Building an Item Bank 99 2.5.1. Applying the Rasch Model 100 2.5.2. Anchoring and Item Calibration 104 2.5.3. Item and Person Fit 117 2.5.4. Data Management 132 2.5.5. Limitations 135 2.6. Taking Stock and Looking Ahead 139 2.6.1. Lessons to Learn 140 2.6.2. Into the Future 142 2.7. Using the UP Data in Further Empirical Research 145 2.7.1. Stating the problem 145 2.7.2. Research design 147 2.7.3. Results 149 2.7.4. Implications 154 2.7.5. Conclusion 156 Conclusions 159 References 161 Index 173 List of figures and tables 179 Appendices 181 8 Acknowledgements I would like to thank Marianne Nikolov and Jözsef Horväth, my colleagues for their support and invaluable comments an earlier drafts of this book. Their contribution is greatly appreciated. I also wish to thank Charles Alderson and Caroline Clapham of Lancaster University, whose expertise and readiness to advise me helped me tremendously with various research problems I encountered. I would like to express my gratitude to the British Council as well for providing funding for my study trips, without which carrying out my research would have been extremely difficult, perhaps even impossible. I am indebted to the Department of English Applied Linguistics at the University of Nes, and the "With Language Competence for a Unified Europe" Foundation for their generous support in making the publication of this book possible. I also thank my wife, Marianna and my daughter, Dalma, for their patience and support, which have served as never-ending inspiration for me. Most of all though, I thank God for giving me the ability, strength, and determination to complete this book. May this work serve His glory. 9 Introduction If someone carried out a survey on how people—whether students or teachersfeel about language testing, the results would most probably indicate that testing is a necessary evil, but an evil all the same. It must be done for administrative purposes, but it hinders rather than facilitates the learning process. While it is certainly true that testing may have some negative impact on certain learners' motivation, the information provided by efficient testing far outweighs the unfavorable side effects. To accept this assumption, one must understand that this information is not merely and not even primarily about who gets what grade or who passes and who fails, rather about whether measuring second language performance can be done effectively enough to provide data about learners' level of proficiency as related to certain criteria. The significance of this information is easy to discern: by learning about candidates' strengths and weaknesses teaching can be given a boost in terms of focusing attention on what appears to be a problematic area. Though some would argue here claiming that the above is true only in the case of diagnostic tests, testing in general and proficiency testing in particular do provide information which can be used for diagnostic purposes as well. In other words, whichever test type we choose for some specific purpose, there is always an opportunity to use the results as a starting point for practical work. Another consideration to be borne in mind is the use research can make of tests and test results. Second language acquisition research applies several methods to elicit samples of language to be studied, but testing appears to be one of the most objective of all. By devising language tests focusing on areas research intends to concentrate on, it is possible to gain meaningful and reliable information concerning learners' performance and, indirectly, their competence as well. In other words, testing can be an objective means of assessing learners' current position on the continuum often referred to as interlanguage (cf. Selinker 1972). To utilize this potential, it is essential to carry out a detailed analysis of learners' test performance, in the course of which test results are quantified and are made suitable for drawing conclusions from. Moreover, the results of the analyses are suitable not only for evaluating candidate performance but also for assessing item and test performance objectively, thus guaranteeing the efficiency of measurement. Language tests are often administered for selection purposes, and such projects tend to operate on a long-term basis. Clearly, in this case it is desirable to carry out selection on the basis of the same criteria on different occasions. To achieve this, one needs to guarantee that different test versions represent the same level. The best way to accomplish this purpose is to establish a bank of test items whose characteristics are determined objectively, which makes it possible to assemble tests whose characteristics are known in advance. 11 To be able to set up an item bank of this kind, however, one needs to rely on modern test theory, which offers the theoretical background for procedures to obtain objective data on item characteristics. In this book I will present the theoretical background as well as the practical processes related to applications of classical as well as modern test theory in language testing, including the setting up such an item bank. In the first part, I will discuss general concerns related to measurement theory. Then, I will describe psychological measurement specifically, of which educational measurement is a specific sub-field. Next, two test theories will be presented in detail with special regard to their applications in educational contexts. In the following sections I will discuss how these two theories can be utilized in language testing, a specific field of educational measurement. In the second part of the book I will give a detailed description of an ongoing language testing project, which utilizes both theoretical approaches in order to guarantee quality as well as to establish an item bank. Based on over eleven years of research, this part offers a detailed account of the stages of the test construction, the moderation, as well as the item bank building process, including theoretical and practical considerations, implementation procedures, as well as various types of analyses. After the discussion of the results and consequences of statistical analyses, both the limitations and the future prospects of the project are outlined. Finally, conclusions will be drawn. In today's age of quality and quality control, educational measurement is raising more and more interest all over the world—even in countries without substantial resources to be devoted to the field—including the author's native Hungary (see e.g. Csapö 1998). Thus the importance of familiarity with different test theories and their fields of application are growing as well. Modern test theory, however, though it has appeared in some Hungarian sources (e.g. Bärdos, 2002; Csapö 1993; Horväth 1991), has not been in the center of attention. The focus of this book, then, is how, despite its complexity, modern test theory—which is lesser known in Hungary, especially in language assessment—can be used effectively for test construction and evaluation. Clearly, some aspects are controversial, and I do not intend to claim to be able to give all the answers. Still, I believe the book makes it clear that the modern theoretical approach to testing problems is fully justified, and that, in tandem with traditional theories, it offers a more comprehensive answer to many of the challenges language testers have to face. After all, a tester's life is full of challenges. New theories, new items, and new testees all the time. Yet, these challenges provide never ceasing opportunities to learn from our very own mistakes and improve the quality of future tests. 12 1. Measurement Theory 1.1 General Concerns Measurement, as used in the general sense of the word, refers to obtaining information about specific characteristic features of certain things. To be able to measure, one needs to identify the object of measurement as well as the measuring device along with the measuring units (Thorndike and Hagen 1977:9). Measurement, as used in a psychological context, however, calls for more subtle definitions for the categories above. The object of measurement in this sense is called a psychological attribute or trait, which is defined in terms of observable behavior and thus can only be approached indirectly (Crocker and Algina 1986:4). The exact nature of such a trait is determined by the specific field of psychological measurement in question. In this sense, an attribute may be manifested, for instance, in the form of a personality trait, such as empathy, which, in turn, may be measured by means of various measuring devices or tests, the results of which may then serve as a basis for drawing conclusions or establishing links to other traits (cf. Larsen-Freeman and Long 1991:189-190). Once the object of measurement has been defined, it is the nature of the measuring device that needs clarification next. Crocker and Algina (1986:4) define a test as "... a standard procedure for obtaining a sample of behavior from a specified domain." Carroll's (1968:46) earlier definition claims that a test should "... elicit certain behavior from which one can make inferences about certain characteristics of an individual." The two definitions tap at two crucial characteristics of tests, namely their standardized nature and their ability to offer basis for drawing meaningful conclusions about the test taker. While these working definitions offer a theoretical framework for measurement, the actual process of test construction—as we shall see—is far more complex. It seems doubtless that out of the three major concerns related to measurement in general it is the issue of measuring units that is most problematic in psychological measurement. As a number of authors observe, in physical measurement the units of measurement (e.g. of length) are clearly and objectively definable, whereas in psychological measurement such units are never objective, and even a broad definition may provoke counter-suggestions (Stanley 1972:60- 61; Guilford and Fruchter 1978:23-24; Crocker and Algina 1986:6-7). Thus it is rather problematic to answer questions like whether a score of zero an a test indicates total lack of knowledge, or whether an increase in the score means proportionate increase in ability as well (Crocker and Algina 1986:6-7). Apart from the problem of measuring units, however, Crocker and Algina (1986:6-7) identify four more general problems in psychological measurement. First, they claim that there are numerous approaches to measuring any construct, 13 none of which is universally accepted. In other words, two tests aimed at measuring the same construct may elicit quite different types of responses from the same examinee, owing to the fact that the tests focused an different types of behavior in their attempt to define the construct operationally. The consequence of this may well be different conclusions concerning the testee's ability. The second problem is that of limited samples of behavior. Whatever the object of measurement may be, it is clearly impossible to confront an examinee with all the possible problems he/she may face concerning a particular ability. Thus, the actual measuring device has to elicit a sample of behavior which is representative enough for measurement purposes. Needless to say, ensuring this may be rather difficult. (The issue of content validation will be discussed in more detail in Section 1.3.3.) The third problem concerns the error of measurement. It is a well-known fact that if a student takes the same test twice, his/her scores will most probably be different. This difference will most likely be manifest even if the test functions well, as the test takers are influenced by numerous factors outside the test, such as fatigue or guessing. Owing to such factors there will always be some kind of measurement error present, though minimizing it is a paramount concern. (The role of measurement error in establishing reliability of measurement will be further discussed in Section 1.3.2.) The fourth concern relates to the connection between a particular construct and other constructs or observable phenomena. Defining a construct in terms of observable behavior alone is of little use. To be able to interpret the results of psychological measurement it is necessary to determine how the construct measured relates to other elements of the theoretical framework. To do this with empirical accuracy is what Crocker and Algina (1986:7) call "... the ultimate challenge in test development." The problem points identified above appear to present a daunting task for test developers. Indeed, finding the practical solutions requires a sound theoretical background. Test theory as such intends to provide a basis for examining and—at least in part—for solving the problems enumerated earlier. The majority of the results and the procedures related to psychological measurement are applied in the field of education as well. Indeed, the object of the present volume is related to a specific field of educational measurement, too. Thus, it seems logical to examine next some aspects of educational measurement. 14 1.2 Educational Measurement Measurement in general requires precision and accuracy, since the results are supported to be meaningful enough to provide reliable information for various forms of decision making. This is all the more so in the case of educational measurement. This claim can be supported by arguing that first, whenever mental attributes are measured with the purpose of decision making, effectiveness of measurement—owing to its indirect nature—is always doubtful, yet always of crucial importance. In this sense, educational measurement has all the characteristics and all the problems of psychological measurement. Second, in educational measurement the most common purpose of measurement in general is decision making. Thus, masses of people are tested regularly; moreover, the decisions made on the basis of the results of measurement may have major consequences concerning the students involved. All this underlines the importance of examining how the general problems of psychological measurement identified earlier are manifested in the context of educational measurement. At this point it seems appropriate to examine the distinction between the two major types of educational measurement: formative and summative assessment. Formative assessment relates to identifying what students have or have not learnt, and, therefore, with the information gained from and the decisions made on the basis of its results it intends to aid the teaching-learning process. Summative assessment, on the other hand, is intended specifically for selection and certification purposes, thus providing evaluative information (Gipps and Murphy 1994:260). This distinction is important in many ways, but concerning the five problem areas identified earlier it is only the issue of limited samples that offers qualitative differences between the two types of assessment. Since formative assessment is directly linked to the process of teaching and learning, it seems logical to assume that if one intends to make decisions on the grounds of the results, there has to be a series of tests administered to the population. Such continuous assessment then seems to eliminate the problem of limited samples. Yet it has to be noted that concerning the individual tests, the sample is limited by definition, unless each test of the series focuses on the very same points—an unlikely condition. Our conclusion then is that though formative and summative assessment differ in many ways, neither approach provides a solution to the measurement problems described earlier. The real answer to at least part of the problems is provided by psychometrics, the science of using mathematical procedures in the field of psychological measurement. By means of psychometrics, the problems of sampling, measurement error, and the difficulties of defining units on measurement scales can be turned into statistical problems, which can be solved by statistical means. The application of psychometrics in educational measurement has been advocated for decades now (e.g. Ebel 1972; Brown 1976), and even specific applications to 15 particular fields of educational measurement—e.g. language testing (Henning 1987)—have been proposed. Some, however, have also questioned whether it is possible to apply methods used in psychological measurement in an educational context, as identifying a single trait may be even more difficult, and since typically it is multidimensional abilities that are measured, criteria should be fundamentally different (Brown 1980; Goldstein 1980). A counter-argument is presented by Choppin as he points out that educational measurement has an important role—among other things—in identifying individual students' problems (1981:213-215), and that measurement is unidimensional as it intends to quantify something as opposed to operations (e.g. examinations), which may relate to several dimensions (1981:205-207). Despite such concerns, it seems inevitable that psychometric procedures be used in educational measurement. The reason for this is quite simple. As even in the most humanistic classroom there is need for assessment, the job of the professional is to ensure that the tests used are the best possible. As we have seen, general measurement problems can only be handled successfully by psychometric means, thus eliminating psychometrics would make effective test construction and evaluation virtually impossible. Needless to say, the application of psychometrics in educational measurement has obvious limitations. Statistics can only contribute in case of objectively quantifiable data. Subjective assessment—e.g. of foreign language oral proficiency—offers little room for psychometric procedures, though even in this case monitoring inter-rater reliability necessitates certain statistical calculations. Ongoing informal assessment, however, can certainly not be placed within the boundaries of items and responses and, thus, is no subject to psychometric inquiries. Obviously, psychometrics does not provide an answer to all measurement problems either. Indeed, the issue of relating a construct measured to other elements of the theoretical framework requires a fundamentally different solution. One such solution is presented by Marzano, Pickering, and McTighe (1993). Their Dimensions of Learning Model is made up of five interrelated yet distinct fields or "dimensions," namely Positive Attitudes and Perceptions About Learning, Acquiring and Integrating Knowledge, Extending and Refining Knowledge, Using Knowledge Meaningfully, and Productive Habits of Mind (Marzano et al. 1993:1-5). All the data gathered through assessment are placed and interpreted within this framework, which makes it possible to establish relationships between the various constructs defined within the various dimensions. Despite the theoretical appeal of the model, however, it must be noted that the authors do not present a model for the quantification of the content of these dimensions. While the theoretical content of each dimension is described extensively, it still remains doubtful how—or, in fact, whether—it is possible to use assessment procedures by means of which quantifiable data can be interpreted in 16 a concrete, meaningful way using the model's complex interrelations between dimensions. So far I have given an overview of some crucial aspects of measurement theory in general and educational measurement in particular. It has been demonstrated that applying psychometric procedures is desirable, indeed, necessary for successful test construction and evaluation. In the following we are going to examine and compare two statistical approaches to measurement theory, often labeled as "classical" and "modern" test theory (Gustaffson 1977; Crocker and Algina 1986). Though they are different in many ways, they should not be considered as rivals, rather as complementary (Huhn, Drasgow and Parsons 1983:67). Following a chronological as well as a logical order, let us first take a look at classical test theory, which will, in turn, be followed by an examination of modern test theory. 1.3 Classical Test Theory classical test theory has its origins in the work of Spearman in the early part ‘..-/ of this century (Crocker and Algina 1986:106). His concept of what is known today as the Classical True Score Model served as the starting point for developing various mathematical procedures for test data analysis (cf. Magnusson 1967; Lord and Novick 1968). In this section first I am going to examine the essence of the True Score Model, which will then be followed by an account of an essential component of the model: reliability. Then I am going to take a closer look at another essential field of analysis, namely validity. Finally, I will present item and person statistics made possible in the framework of Classical Test Theory in order to show the scope as well as the limitations of traditional analyses in this regard. 1.3.1 The True Score Model Spearman's original concept is based an a simple formula: X=T+E where X is a particular test taker's observed score, which is made up of the true score (T) and the error of measurement (E) (Crocker and Algina 1986:107). The true score is defined by Guilford and Fruchter (1978:409) as the score the examinee would achieve if the measuring instrument used was perfect and the conditions were ideal. What this would mean in practice is that the error of measurement, which has already been identified as a general measurement Problem, would be entirely eliminated. Obviously, in practical terms this is not possible. Consequently, the operational definition of the true score can be grasped 17 by imagining a candidate taking a particular test an indefinitely large number of times—without the repetitions' having any effect—and then taking the average of the observed scores, which would then effectively be the true score (Bachman 2004:158-159; Crocker and Algina 1986:109; Hughes 1989:33;). It follows from here that, according to the model, the value of the true score is assumed to be constant over all administrations of the test (Thorndike 1982a:4). It is important to point out, however, that there is a major difference in terms of the meaning of the concept of true score between psychological and physical measurement. As Crocker and Algina point out, if a physician suspects liver disease and examines a patient, the patient has an absolute true score on this variable. Though errors of measurement may occur, and different laboratory tests may give different results, these are quite independent of the patient's true score on liver disease. In psychological measurement, however, the true score is dependent on the measurement process used (1986:109-110). In other words, it is in fact the difficulty of defining psychological constructs for measurement that lies beneath this problem. Liver disease is an objectively definable construct with several physically measurable characteristics. Intelligence (in Crocker and Algina's example) or any other psychological and educational variable, on the other hand, is problematic in this respect. Another problem to raise concerning the true score and its estimation is related to the actual procedures applied. Even the operational definition delineated earlier allows for an indefinitely large number of administrations of the same test to the same candidate; moreover, these repetitions are not supposed to influence performance in any way. Clearly, in practical terms even this definition offers little help in determining the value of the true score. Indeed, the actual value of the true score cannot be determined. Instead, based on standard deviation figures, it is the Standard Error of Measurement (SEM) that provides information about the true score. As Crocker and Algina explain, Just as the total group has a standard deviation, theoretically each examinee's personal distribution of possible observed scores around the examinee's true score has a standard deviation. When these individual error standard deviations are averaged for the group, the result is called the standard error of measurement (1986:122) The information provided by the SEM can be used to establish a confidence interval around a particular candidate's observed score. This means that SEM figures make it possible to establish the probability of the candidate's true score falling within one standard deviation from the observed score. This also means that one can never really be sure about a candidate's true score; moreover, since SEM figures are based on the average of several candidates' individual standard 18 errors, a particular candidate's standard error can quite possibly be different from the mean (Crocker and Algina 1986:123-124). The other component in the original formula is measurement error. As we have seen so far, even in the discussion of the concept of true score, it is measurement error that has a crucial role. The smaller the measurement error, i.e. the less the test allows the candidate to be influenced by factors outside his/her competence, the closer the observed score is to the true score. And this brings us to the essence of the Classical True Score Model, namely the issue of reliability. Crocker and Algina (1986) present a practical definition of reliability: "reliability is the degree to which individuals' deviation ... scores remain relatively consistent over repeated administration of the same test or alternate test forms" (105). Naturally, reliability is a paramount concern. If test scores fluctuate dramatically over repeated administrations, the results cannot be used for decision making or, in fact, for any other purpose typical of educational measurement. Thus, ensuring high reliability or, in other words, keeping measurement error at a minimum must be a major concern in the process of developing any educational measurement instrument. The next section will provide a technical overview of the concept of reliability along with different approaches to and procedures for estimating test reliability. 1.3.2 Reliability Based on the model, it is obvious that the observed variance of scores is equal to the variance of true scores plus the variance of measurement error. Reliability then is to indicate what proportion of the variability in observed scores is attributable to true score variability. Hence, reliability is defined as the ratio of variance of true score to variance of observed score (Linn and Werts 1979:54). If this value is 1, then the value of measurement error is zero. In other words, we have a perfectly reliable test. Real tests, however, can never evade measurement error; moreover, it is in fact impossible to tell to what extent true score and error variance contribute to the variance of observed score, respectively. Therefore, it is necessary to devise a second measurement instrument for which the true scores of every individual candidate are to be the same as on the first one, but for which measurement errors are independent. Thus, assuming that the variances of measurement errors are the same, the tests' reliability can be estimated through a correlation of observed scores (Linn and Werts 1979:54-55). The construction and administration of this second test, however, presents numerous problems. The literature tends to identify three approaches to estimating reliability based on the correlation of results from two measurements: the test-retest method, the alternate forms method, and the split half/halves method 19 (Ebel 1972; Linn and Werts 1979; Krzanowski and Woods 1984; Crocker and Algina 1986; Hughes 1989). As the term implies, the test-retest method is based on repeated administrations of the same test (Hughes 1989:32). Crocker and Algina (1986:133) call the reliability figure obtained from this procedure the coefficient of stability. The difficulties are apparent in this case. For this approach to yield meaningful results, it must be ensured that the results of the second administration are not influenced by the effect of repetition, i.e. practice or memory. Thus, a suitably long period of time must elapse between the two administrations. If the internal is too long, however, effects of learning or forgetting—i.e. changes in true score—may influence test scores (Krzanowski and Woods 1984:6). How long "suitably long" is, however, depends on the kind of trait measured and may vary between one day and even two years (Crocker and Algina 1986:133-134). Moreover, there seems to exist no objective means to monitor possible changes in true score. Ebel (1972) also points out that the test-retest method cannot account for possible changes in scores owing to a different sampling of items from a usually large population of possible items (412). An alternative to the test-retest procedure is the alternate forms method (Hughes 1989: 32). Here the second set of results is produced by a test that is different from, yet equivalent to, the first test (Ebel 1972:412). A kind of "twin," which has the same characteristic features but which is still a different "entity." The reliability figure obtained by this procedure is called the coefficient of equivalence (Crocker and Algina 1986:132). Hughes points out an obvious problem claiming that "... alternate forms are often simply not available" (1989:32). Indeed, the very same concern is voiced from a different angle by Linn and Werts when they caution that the claimed equivalence of alternate forms is often based on "strong assumptions" (1979:55). To ensure equivalence Crocker and Algina (1986:132) suggest comparing the means, standard deviations, and standard errors of measurement for both tests. The third method of estimating reliability, though based on correlation of two measurements, requires only one test administration. Here a single test is divided into two parts of equal length, and the parts are scored separately. Thus only one test administration is needed, which, however, results in two sets of scores, yielded by the two halves of the same test; hence the name, split half/halves method (Hughes 1989:32-33). The reliability estimate resulting from simple correlation figures would most probably be an underestimate of the reliability of the whole test, however, as the figures calculated are based only on half of the original test (Crocker and Algina 1986:136-137; Hughes 1989:33). To avoid this problem, the Spearman-Brown prophecy formula is used to obtain the corrected reliability figure for the entire test (Guilford and Fruchter 1978:426; Hughes 1989:158-159). The reliability coefficient obtained this way is a measure of the test's internal consistency (Hughes 1989:32). 20 Obviously, the weakest point of this method is the assumption that the two halves of the test are of equal level of difficulty. Indeed, Hughes points out that "... this method is rather like the alternate forms method, except that the two 'forms' are only half the length" (1989:33). Thus, the problems discussed earlier related to the alternate forms method are not completely eliminated here either. Consequently, estimating reliability with the split half method is only possible if the test can be divided into two parts of equal length and difficulty. This point is underlined by Ebel (1972:414), who claims that reliability figures estimated by means of this method may well be influenced by how the test is divided, as certain divisions may produce more closely equivalent parts than others. As for the actual estimation procedures, several are described in the literature, and sources vary concerning which are the most common ones. Here I am going to present five procedures, each of which appears to be worthy of attention in some respect, as they are different in form, yet they produce near equivalent results, at least under certain conditions. Chronologically, the first two of these procedures were presented together in 1937 by Kuder and Richardson (in Crocker and Algina 1986:139). The formulae, KR20 and KR21, can only be used with dichotomously scorable items (Alderson, Clapham and Wall 1995:88). The difference between the two is that KR21 applies a somewhat simplified mathematical formula, in which computing each item variance is not required, assuming equal difficulty for each item. If items differ in their difficulty levels, KR21 estimates will be lower than KR20 figures (Crocker and Algina 1986:139; Alderson et al. 1995:88). At this point it is worth considering whether KR21 figures should be used at all, as in real life tests it seems virtually impossible that items do not vary in difficulty. What needs to be borne in mind, however, is that KR21 can be used even with a simple calculator, and though figures may not be quite accurate, they can be used "... as a lower-bound estimate of the internal consistency coefficient ..." (Crocker and Algina 1986:139). Consequently, in classroom progress testing, where sophisticated computer software is not available, KR21 may provide an admittedly somewhat inaccurate, yet practically available solution for teachers to estimate reliability. A fundamentally different approach to estimating reliability is that of Hoyt's (referred in Crocker and Algina 1986:140), which is based an analysis of variance. The method treats persons and items as sources of variation and uses the mean square term for persons and the mean square term for the residual variance from the analysis of variance summary table (Crocker and Algina 1986:140). The results of this procedure are identical to those of KR20. It should be noted here that though test analysis software tends not to apply this procedure, analysis of variance is usually a standard component of statistical computer software packages (e.g. SPSS). Thus Hoyt's method may be a more practi- 21 cal option for users of such generally applicable software, especially because the results of this analysis are more accurate than those of KR21. The foule method presents yet another approach. This estimation procedure was originally developed by Rulon (1939), and it is virtually identical to a more complex version developed somewhat later by Guttman (in Bachman 1990:175). The major difference of this approach compared to the others is that it does not assume the equivalence of the two halves of the original test and thus does not include the computation of correlation figures between them. Instead, its basis is the ratio of the sum of the variances of the test's two halves to the variance of the entire test (Bachman 1990:175). Bachman points out here that as the formula used in this procedure is based an the variance of the entire test, unlike in the methods involving the Spearman-Brown prophecy formula, in Rulon's method there is no need for additional correction procedures for length (Bachman 1990:175). If the variances of the two halves of the test are equal, the two methods yield identical results. However, as standard deviation figures get more and more dissimilar between the two halves, procedures involving the Spearman-Brown correction formula will yield systematically higher results than the figures obtained from Rulon's or Guttman's formulae (Crocker and Algina 1986:138). The last major method to be discussed here is generally referred to as Cronbach's alpha. This being a general formula, the procedure can be applied both in the case of dichotomously scored items and with items having a range of scoring weights (e.g. essay components scored from 0 to 9). This feature makes it the most useful one in practical terms of the models discussed here, as most measuring instruments tend to include various item types, some of which may necessitate partial credit scoring. It is commonly applied in test analysis software; indeed, one of the most readily available of such programs—Assessment Systems Corporation's ITEMAN—uses Cronbach's alpha to estimate reliability (Alderson et al. 1995:101). When used with dichotomously scored items, the results are identical to KR20's in this case, too (Crocker and Algina 1986:138; Henning 1987:84). Having examined various procedures for estimating reliability, what remains

Show more Read less











Whoops! We can’t load your doc right now. Try again or contact support.

Document information

Uploaded on
October 17, 2021
Number of pages
197
Written in
2021/2022
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

LANGUAGE TESTING AND
EVALUATION: Applying
Item Response Theory in Language
Test Item Bank Building
Volume 10

By Gäbor Szabö

Series editors: Rüdiger Grotjahn
and Günther Sigott

ISSN 1612-815X
ISBN 978-3-631-56851-4

197 PAGES

,LLanguage
a n g u a g e Testring
Te s t i n g
a n d lEwaNattion
and Evaluation 10


Gábor Szabó

Applying
Item Response Theory
in Language Test
Item Bank Building




PETER LANG
Internationaler Verlag der Wissenschaften


12.09.11 22:11:47 Uhr

,Applying Item Response Theory in Language Test Item Bank Building

, Language Testing
and Evaluation
Series editors: Rüdiger Grotjahn
and Günther Sigott




Volume 10




PETER LANG
Frankfurt am Main • Berlin • Bern • Bruxelles • New York • Oxford • Wien

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
Expert001 Chamberlain School Of Nursing
View profile
Follow You need to be logged in order to follow users or courses
Sold
795
Member since
4 year
Number of followers
566
Documents
1190
Last sold
1 day ago
Expert001

High quality, well written Test Banks, Guides, Solution Manuals and Exams to enhance your learning potential and take your grades to new heights. Kindly leave a review and suggestions. We do take pride in our high-quality services and we are always ready to support all clients.

4.2

159 reviews

5
104
4
18
3
14
2
7
1
16

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions