#15
How does IRT information replace CTT reliability? Can this
be found on the audit tool (Table 45)?
This post relates my audit tool, Table 45, Comparison of
Conditional Error of Measurement between Normal [CTT] Classroom Calculation and
the IRT Model to a quote from Wikipedia
(Information). I am confident that the math is correct. I need to
clarify the concepts for which the math is making estimates.
Table 45 |
“One of the major contributions of item response theory is
the extension of the concept of reliability. Traditionally, reliability refers
to the precision of measurement (i.e., the degree to which measurement is free
of error). And traditionally, it is measured using a single index defined in various
ways, such as the ratio of true and observed score variance.”
See test reliability (a ratio), KR20, True/Total Variance,
0.29 (Table 45a).
“This index is helpful in characterizing a test’s average
reliability, for example in order to compare two tests.”
The test reliability for CTT and IRT are also comparable on
Table 45a and 45c, 0.29 and 0.27.
“But IRT makes it clear that precision is not uniform across
the entire range of test scores. Scores at the edges of the test’s range, for example,
generally have more error associated with them than scores closer to the middle
of the range.”
Table 46 |
Chart 82 |
See Table 45c (classroom data) and Table 46, col 9-10 (dummy
data). For CTT the values are inverted (Chart 82,
classroom data and Chart 89, dummy data).
Chart 89 |
“Item response theory advances the concept of item and test
information. Information is also a function
of the model parameters. For example, according to Fisher information theory,
the item information supplied in the case of the 1PL for dichotomous response
data is simply the probability of a correct response multiplied by the
probability of an incorrect response, or,
. . .” [I = pq].
See Table 45c, p*q CELL INFORMATION (classroom data). Also
on Chart 89, the cell variance (CTT) and cell information (IRT) have identical
values (0.15) from Excel =VAR.P and from pq (Table 46, col 7, dummy data).
“The standard error of estimation (SE) is the reciprocal of
the test information of at a given trait level, is the . . .” [1/SQRT(pq)].
Is the “test information … at a given trait level” the Score
Information (3.24, red, Chart 89, dummy data) for 17 right out of 21 items?
Then the reciprocal of 3.24 is 0.31, the error variance (green, Chart 89 and
Table 46, col 9) in measures on a logit scale. And the IRT conditional error of
estimation (SE) would be the square root: SQRT(0.31) = 0.56 in measures. And
this inverted would yield the CTT CSEM: 1/0.56 = 1.80 in counts.
[[Or the SQRT(SUM(p*q)) = SQRT((0.15) * 21) = SQRT(3.24) =
1.80 (in counts) and the reciprocal is 1/1.80 = 0.56 in measures.]]
The IRT (CSEM) in Chart 89 is really the IRT standard error
of estimation (SE or SEE). On Table 45c, the CSEM (SQRT) is also the SE
(conditional error of estimation) obtained from the square root of the error
variance for that ability level (17 right, 1.73 measures, or 0.81 or 81%).
“Thus more information implies less error of measurement.”
See Table 45c, CSEM, green, and Table 46, col 9-10.
“In general, item information functions tend to look
bell-shaped. Highly discriminating items have tall, narrow information
functions; they contribute greatly but over a narrow range. Less discriminating
items provide less information but over a wider range.”
Chart 92 |
Table 47 |
The same generality applies to the item information
functions (IIF)s in Chart 92 but it is not very evident. The item with a
difficulty of 10 (IIF = 1.80, Table 47) is also highly discriminating. The two easiest items had negative discrimination; they show
an increase in information as student ability decreases toward zero measure. The generality applies best near the
average test raw score of 50% or zero measure; which is not on the chart (no student got a score of 50% on this test).
This test had an average test score of 80%. This has spread the item information
function curves out (Chart 92). They are not centered on the raw score of 50%
or the measures zero location. However
each peaks near the point where item difficulty in measures is close to student
difficulty in measures. This observation is critical in establishing the
value of IRT item analysis and how it is used. This makes sense in measures (a natural
log of the ratio of right and wrong mark scale) but not in raw
“Plots of item information can be used to see how much
information an item contributes and to what portion of the scale score range.”
This is very evident in Table 47 and Chart 92.
“Because of local independence, item information functions
are additive.”
See Test SEM (in Measures), Winsteps
Table 17.1 MODEL S.E. MEAN (identical) = 0.64, Table 45c)
“Thus, the test information function is simply the sum of
the information functions of the items on the exam. Using this property with a
large item bank, test information functions can be shaped to control
measurement error very precisely.”
“Characterizing
the accuracy of test scores is perhaps the central issue in psychometric theory
and is a chief difference between IRT and CTT. IRT findings reveal that the CTT
concept of reliability is a simplification.”
At this point my audit tool, Table 45, falls silent. These
two mathematical models are a means for only estimating theoretical values;
they are not the theoretical values nor are they the reasoning behind them. CTT
starts from observed values and projects into the general environment. IRT can start with the perfect Rasch model and select observations that fit the model. The
two models are looking in opposite directions. CTT uses a linear scale with the
origin at zero counts. IRT sets its log ratio point-of-origin (zero) at the 50%
CTT point. I must accept the concept that CTT is a simplification of IRT on the
basis of authority at this point.
“In the place of reliability, IRT offers the test
information function which shows the degree of precision at different values of
theta, [student ability].”
I would word this, “In ADDITION to reliability,” (Table 45a,
CTT = 0.29 and 45c, IRT = 0.27). Also the “IRT offers the ITEM information
function which shows the degree of precision at different values . . .”
“These results allow psychometricians to (potentially)
carefully shape the level of reliability for different ranges of ability by
including carefully chose items. For example, in a certification situation in
which a test can only be passed or failed, where there is only a single
“cutscore,” and where the actually passing score is unimportant, a very
efficient test can be developed by selecting only items that have high
information near the cutscore. These items generally correspond to items whose
difficulty is about the same as that of the cutscore.”
The eleven items in Table 47 and Chart 92 each peak near the
point where item difficulty in measures is close to student difficulty in
measures. The discovery or invention of this relationship is the key advantage
of IRT over CTT.
These data show that a test item need not have to have (a commonly
recommended) average score near 50% for useable results. Any cutscore from 50%
to 80% would produce useable results on this test with an average score of 80%
and cutscore (passing) of 70%.
"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."
My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location).
"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."
My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location).
I do not know what part of IRT is invention and what part is discovery on the part of some ingenious people. Two basic parts had to be fit together: information and measures by way of an inversion. Then a story had to be created to market the finished product; the Rasch model and Winsteps (full and partial credit) are the limit of my experience. The unfortunate name choice of “partial credit” rather than knowledge or skill and judgment may have been a factor in the Rasch partial credit model not becoming popular. The name, partial credit, falls into the realm of psychometrician tools. The name, Knowledge and Judgment, falls into the realm of classroom tools needed to guide the development of scholars as well as obtain maximum information from paper standardized tests; where students individually customized their tests (accurately, honestly, and fairly) rather than CAT where the test is tailored to fit the student; using best-guess, dated, and questionable second hand information.
IRT makes CAT possible. Please see "Adaptive Testing Evolves to Assess Common-Core Skills" for current marketing, use, and a list of comments, including two of mine. The exaggerated claims of test makers to assess and promote deveoping students by the continued use of forced-choice lower level of thinking tests continues to be ignored in the marketing of these tests to assess Common Core skills. Increased precision of nonsense still takes precedence over an assessment that is compatible with and supports the classroom and scholarship.
Serious mastery: Knowledge Factor.
Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).
Ranking: Forced-choice on paper or CAT.
Serious mastery: Knowledge Factor.
Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).
Ranking: Forced-choice on paper or CAT.