APP下载

Introduction to Classical Test Theory

2017-03-31孙千惠

青春岁月 2017年3期
关键词:助教语言学簡介

Abstract:This paper gives an introduction to the Classical Test Theory (CTT), including the history, the procedure, the expansion of CTT. Also in this paper, shortcomings and reasons of its downfall are listed.

Key words:CTT;theory introduction

【摘要】本文介紹了经典测试理论,并且给出了经典测试理论的发展历史,使用流程以及拓展。此外,文中还介绍了经典测试理论的缺点和其逐渐没落的原因。

【关键词】经典测试理论;理论介绍

1. Introduction

Classical Test Theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. Generally speaking, the aim of the theory is to understand and improve the reliability of psychological tests.

2. History

CTT was born only after the following 3 achievements or ideas were conceptualized: a recognition of the presence of errors in measurements, a conception of that error as a random variable, and a conception of correlation and how to index it. In 1904, Charles Spearman was responsible for figuring out how to correct a correlation coefficient for attenuation due to measurement error and to obtain the index of reliability needed in making the correction, and his finding was seen as the beginning of the theory(Traub, 1997). Others who had an influence in the theorys framework include: G U Yule, K R Formulas, M R Novick, etc. CTT as we know it today was codified by Novick (1966) and described in classic texts such as Lord & Novick (1968) and Allen & Yen (1979/2002).

Spearman created the theory in 1904, which was loosely utilized until 1966 when Novick put its use at the forefront of psychological theory (Novick, 1966). CTT can be identified as the theory of a true-test score, taking into account the previous score of a test item or a test-taking population to predict a future score for the same item or population. Using previous scores, theorists can predict which test questions will be answered correctly and which population tends to answer the questions successfully. Successful responses are then referred to as normative responses.

When considering a population, the entire population must be taken into account. For example, if all of the eleventh graders in the United States took the Advanced Placement Exam (APE) for English and the same overall score was identified trial after trial, that score would be identified as the normative score for the population. It is meaningless when correlated with any individual. One could individually score higher or lower than the normative score; however, CTT can make reliable identifications based on populations or individuals, depending upon the purpose of the test.

CTT believes that each person has a true score T that would be obtained if there were no errors in measurement. Unfortunately, test users never observe a person's true score, only an observed score, X, which is assumed to equal true score T plus some error E. The relations between the three variables X, T and E are used to describe the quality of test scores. The reliability of the observed test scores X, which is denoted as {\rho^2_{XT}}, is defined as the ratio of true score variance {\sigma^2_T} to the observed score variance {\sigma^2_X}:

{\rho^2_{XT}} = \frac{{\sigma^2_T}}{{\sigma^2_X}}

Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the variance of error scores, this is equivalent to

{\rho^2_{XT}}=\frac{{\sigma^2_T}}{{\sigma^2_X}}= \frac{{\sigma^2_T}}{{\sigma^2_T}+{\sigma^2_E}}

This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to the proportion of the variance in test scores that we could explain if we knew true scores. The square root of the reliability is the correlation between true and observed scores.

3. The process of CTT

1. come up with the question; 2. get data; 3. analysis data; 4. explain data; 5. come to a conclusion

And the pattern of data contains: 1. Nominal scale; 2. Ordinal scale; 3. Interval scale

4. Item Discrimination

The more an item discriminates among individuals with different amounts of the underlying concept of interest, the higher the item-discrimination index. The extreme group method can be used to calculate the discrimination index using the following 3 steps. Step 1 is to partition respondents who have the highest and lowest overall scores on the overall scale, aggregated across all items, into upper and lower groups. Step 2 is to examine each item and determine the proportion of individual respondents in the sample who endorse or respond to each item in upper and lower groups. Step 3 is to subtract the pair of proportions noted in Step 2. The higher this item-discrimination index, the more the item discriminates. It is useful to compare the discrimination indexes of each of the items in the scale.

5. Second language test

For ESL students, the fastest growing community of school-age children, it is common to have a non-native English speaker in the classroom. However, there is only one exam given to ESL students, the Test of English as a Foreign Language (TOEFL), as an entrance exam for students applying to college. The format for the TOEFL is a standardized, multiple-choice question exam. Dudley (2006) offers that multiple true-false question exams (MTF), can be just as reliable and a valid alternative to multiple-choice tests, which can be confusing to students (p. 199).

Dudley (2006) took two forms of test, which were multiple-choice in nature, and converted them to a multiple true-false format. He notes the findings are supportive with MTF format. (Dudley, 2006, p. 224) He also notes that conclusions of the study have provided sound empirical evidence that central factors such as item interdependence, reliability and concurrent validity are viable with MTF items that assess vocabulary and reading comprehension in the realm of norm-referenced testing (p. 224). Even though Dudley's (2006) focus was on undergraduate students, it is not a far reach to offer that teachers in the K-12 sector could begin creating MTF nature or converting already created multiple-choice exams to MTF using CTT.

6. Reliability

Reliability is important in the development of PRO measures. Validity is limited by reliability. If responses are inconsistent(unreliable), it necessarily implies invalidity. Reliability refers to the proportion of variance in a measure that can be ascribed to a common characteristic shared by the individual items, whereas validity refers to whether that characteristic is actually the one intended.

Test–retest reliability, which can apply to both single-item and multi-item scales, reflects the reproducibility of scale scores on repeated administrations over a period during which the respondents condition did not change. As a way to compute test–retest reliability, the kappa statistic can be used for categorical responses, and the intraclass correlation coefficient can be used for continuous responses. Further, having multiple items in a scale increases its reliability. In multi-item scales, a common indicator of scale reliability is Cronbach coefficient alpha, which is driven by the number of items and correlations of items in the scale.

The greater the proportion of shared variation, the more the items share in common and the more consistent they are in reflecting a common true score. The covariance-based formula for coefficient alpha expresses such reliability while adjusting for the number of items contributing to the prior calculations on the variances. The corresponding correlation–based formula, an alternative expression, represents coefficient alpha as the mean inter-item correlation among all pairs of items after adjustment for the number of items.

7. Shortcomings

One of the most well-known shortcomings of CTT is that examinee characteristics and test characteristics cannot be separated: each can only be interpreted in the context of the other. Another shortcoming lies in the definition of Reliability in CTT, which states that reliability is "the correlation between test scores on parallel forms of a test".The problem is that various reliability coefficients provide either lower bound estimates of reliability or reliability estimates with unknown biases. A third shortcoming involves the standard error of measurement. The problem here is that, the standard error of measurement is assumed to be the same for all examinees. However, as Hambleton explains in his book, scores on any test are unequally precise measures for examinees of different ability, thus making the assumption of equal errors of measurement for all examinees implausible (Hambleton, Swaminathan, Rogers, 1991, p.4). A fourth and final shortcoming of CTT is that it is test oriented, rather than item oriented. In other words, CTT cannot help us make predictions of how well an individual or even a group of examinees might do on a test item.

What makes CTT effective is also its primary downfall in that the normative scores used to predict future scores are specific to the samples previously studied. One may have received the highest score on the exam but was grouped with the population of test-takers when the APE results were used to predict future success or effectiveness. (Reid, et. al, 2007, p. 179). A secondary problem with CTT is that to gain useful information, an entire testing instrument has to be completed to gain predictable information regarding a population or an individual. Only the completed exam is what matters. Finally, as Reid, et al. (2007) points out, "the instability of scores at extreme levels of an ability or trait, even within the normative sample" is a concern with CTT (p. 179).

8. Conclusion

Although CTT has a lot of shortcomings in modern life, but its truly a famous theory and contributes much to the education and modeling and something like that. Its a practical way of tackling the complex questions and problems by collecting data, analyzing data and giving answers.

【Reference】

[1] American Psychiatric Association. Diagnostic and statistical manual of mental disorders(3rd ed., rev.)[M]. Washington, DC: Author, 1987.

[2] Bolton, B. Handbook of measurement and evaluation in rehabilitation(3rd ed.)[M]. Gaithersburg, MD: Aspen, 2001.

[3] Brown, J.D. and Hudson, T. Criterion-referenced language testing[M]. New York, NY: Cambridge University Press, 2002.

[4] Corkum, P. Andreou, P. Schachar, R. Tannock, R. & Cunningham, C. The Telephone Interview Probe[M]. Educational & Psychological Measurement, 2007:67,169-185.

[5] Cronbach, L. J. Note on the multiple true-false test exercise[M]. Journal of Educational Psychology, 1939:30,628-31.

[6] Cronbach, L. J., Nageswari, R., & Gleser, G.C. Theory of generalizability: A liberation of reliability theory[M]. The British Journal of Statistical Psychology, 1963:16,137-163.

[7] Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. The dependability of behavioral measurements: Theory of generalizability for scores and profiles[M]. New York: John Wiley, 1972.

[8] Dudley, A. Multiple dichotomous-scored items in second language testing: investigating the multiple true-false item type under norm-referenced conditions[M]. Language Testing, 2006:23,198-228.

[9] Haladyna, T. M. Developing and validating multiple-choice test items(2nd ed.)[M]. Mahwah, NJ: Lawrence Erlbaum, 1999.

[10] Koppitz, E. M. Psychological evaluation of children's human-figure drawings[M]. New York: Grune & Stratton, 1968.

[11] Novick, M. R. The axioms and principal results of classical test theory[M]. Journal of Mathematical Psychology, 1966:3,1-18.

【作者簡介】

孙千惠(1992—),女,汉族,硕士研究生学历,天津市武警后勤学院大学英语助教,研究方向:外国语言学及应用语言学。

猜你喜欢

助教语言学簡介
语言学研究的多元化趋势分析
Research on Guidance Mechanism of Public Opinion in Colleges and Universities in Micro Era
Book review on “Educating Elites”
Hometown
A study on the teaching practice of vocational English teaching connected with the working processes
A Pragmatic Study of Gender Differences in Verbal Communication
书讯《百年中国语言学思想史》出版
The Influence of Memetics for Language Spread