Classical Test Theory and Item Response Theory
Maneeratsami Pattanasombutsook, PhD.
Boromarajonani College of Nursing, Yala, Thailand.
Classical test theory (CTT) and Item Response Theory (IRT) are widely used as statistical measurement frameworks. CTT is approximately 100 years old, and still remains commonly used because it is appropriate for certain situations. Although CTT has served the measurement community for most of this century, IRT has witnessed an exponential growth in recent decades. IRT is generally claimed as an improvement over CTT.
Classical Test theory (CTT)
CTT is a theory about test scores that introduces three concepts - test score (observed score), true score, and error score. A simple linear model is postulated linking the three concepts as the basic formulation as follow:
O (observed score) = T (true score) + E (random error)
The assumptions in CTT model are that
(1) true scores and error scores are uncorrelated,
(2) the average error score in the population of respondent is zero, and
(3) error scores on parallel tests are uncorrelated.
CTT is assumed that measurements are not perfect. The observed score for each person may differ from their true ability because the true score influenced by some degree of error. All potential sources of variation existing in the process of testing either external conditions or internal conditions of person are assumed to have an effect as random error. It is also assumed that random error found in observed scores are normally distributed and uncorrelated with the true scores. As this equation, minimizing the error score and reducing the difference between observed and true scores is desirable to yield more true score answers.
The CTT models have linked test scores to true scores rather than item scores to true scores. Scores obtained from CTT applications are entirely test dependent. In addition, the two statistics (item difficulty and item discrimination) are entirely dependent on the respondent sample taken the test, as well as reliability estimates are dependent upon test scores from beta samples.
Advantage and implication of CTT
The main advantage of CTT is its relatively weak theoretical assumptions, which make CTT easy to meet real data and modest sample size, and apply in many testing situations. CTT is useful for assessing the difficulty and discrimination of items, and the precision with which scores are measured by an examination.
In application, the main purpose of CTT within psychometric testing is to recognise and develop the reliability of psychological tests and assessments.
1) True scores in the population are assumed to be measured at the interval level and normally distributed.
2) Classical tests are built for the average respondents, and do not measure high or low respondents very well.
3) Statistics about test items depend on the respondent sample being representative of population. It can only be confidently generalized to the population from which the sample was drawn. As well as generalization beyond that setting must be careful consideration.
4) The test becomes longer, the more reliability.
5) Researcher should not rely on previous reliability estimates of previous study. It is suggested to estimate internal consistency for every study using the sample obtained because estimates are sample dependent.
Item Response Theory (IRT)
The item response theory (IRT) refers to a family of mathematical models that establishes a link between the properties of items on an instrument, individuals responding to these items, and the underlying trait being measured. IRT assumes that the latent construct (e.g. stress, knowledge, attitudes) and items of a measure are organized in an unobservable continuum. It focuses on establishing the individual’s position on that continuum. IRT models can be divided into two families: unidimensional and multidimensional. There are a number of IRT models varying in the number of parameters (one, two and three-parameter models), and non-parametric (Mokken scale).
IRT Assumptions
The purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work.
1) Monotonicity – The assumption indicates that as the trait level is increasing, the probability of a correct response also increases.
2) Unidimensionality – The model assumes that there is one dominant latent trait being measured and that this trait is the driving force for the responses observed for each item in the measure.
3) Local Independence – Responses given to the separate items in a test are mutually independent given a certain level of ability.
4) Invariance – It is allowed to estimate the item parameters from any position on the item response curve. Accordingly, we can estimate the parameters of an item from any group of subjects who have answered the item.
Each item on a test has its own characteristic curve that describes the probability of getting each item right or wrong given the ability of the person.
Item Response Function (IRF)
IRF is the relation between the respondent differences on a construct and the probability of endorsing an item. The response of a person to an item can be modeled by a mathematical item response function (IRF).
Item Characteristic Curve (ICC)
IRFs can be converted into Item Characteristic Curve (ICC) which is graphic functions that represent the respondent ability as a function of the probability of endorsing the item. Depending on the IRT model used, these curves indicate which items are more difficult and which items are better discriminators of the attribute.
Item Information Function (IIF)
Each IRF can be transformed into an IIF. The information is an index representing the item's ability to differentiate among individuals.
Discrimination - height of the information (tall and narrow IIFs- large discrimination, short and wide IIFs - low discrimination)
Test Information Function
We can judge the test as a whole and see at which part of the trait range it is working the best.
The IRT mathematical model is defined by item parameters. Parameters on which items are characterized include their difficulty (b), discrimination (a), and a pseudoguessing parameter (c).
-Location (b): location on the difficulty range
"b" is the item difficulty that determines the location of the IRF, an index of what level of respondents for which the item is appropriate; typically ranges from -3 to +3, with 0 being an average respondent level.
-Discrimination (a): slope or correlation
"a" is the item's discrimination that determines the steepness of the IRF, an index of how well the item differentiates low from top respondents; typically ranges from 0 to 2, where higher is better.
-Guessing (c)
"c" is a lower asymptote parameter for the IRF, typically is focus on 1/k where k is the number of options. The inclusion of a "c" parameter suggests that respondents with low trait level may still have a small probability of endorsing an item.
-Upper asymptote (d)
"d" is an upper asymptote parameter for the IRF. The inclusion of a "d" parameter suggests that respondents very high on the latent trait are not guaranteed to endorse the item.
Advantages and Disadvantages of IRT

IRT provides flexibility in situations where different sample or test forms are used. As IRT model’s unit of analysis is the item, they can be used to compare items from different measures provided that they are measuring the same latent construct. Moreover, they can be used in differential item functioning, in order to assess why items that are calibrated and test, still behave differently among groups. Thus, that is allowed IRT findings are foundation for computerized adaptive testing.
IRT models are generally not sample- or test-dependents.
However, IRT are strict assumptions, typically require large sample size (minimum 200; 1000 for complex models), more difficult to use than CTT: IRT scoring generally requires relatively complex estimation procedures, computer programs not readily available and models are complex and difficult to understand.