Wednesday, May 13, 2015

Information and Reliability

How does IRT information replace CTT reliability? Can this be found on the audit tool (Table 45)?

This post relates my audit tool, Table 45, Comparison of Conditional Error of Measurement between Normal [CTT] Classroom Calculation and the IRT Model to a quote from Wikipedia (Information). I am confident that the math is correct. I need to clarify the concepts for which the math is making estimates.

Table 45
“One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance.”

See test reliability (a ratio), KR20, True/Total Variance, 0.29 (Table 45a).

“This index is helpful in characterizing a test’s average reliability, for example in order to compare two tests.”

The test reliability for CTT and IRT are also comparable on Table 45a and 45c, 0.29 and 0.27.

“But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test’s range, for example, generally have more error associated with them than scores closer to the middle of the range.”

Table 46
Chart 82
See Table 45c (classroom data) and Table 46, col 9-10 (dummy data). For CTT the values are inverted (Chart 82, classroom data and Chart 89, dummy data).

Chart 89
“Item response theory advances the concept of item and test information. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response multiplied by the probability of an incorrect response, or,  . . .” [I = pq]. 

See Table 45c, p*q CELL INFORMATION (classroom data). Also on Chart 89, the cell variance (CTT) and cell information (IRT) have identical values (0.15) from Excel =VAR.P and from pq (Table 46, col 7, dummy data).

“The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the . . .” [1/SQRT(pq)].

Is the “test information … at a given trait level” the Score Information (3.24, red, Chart 89, dummy data) for 17 right out of 21 items? Then the reciprocal of 3.24 is 0.31, the error variance (green, Chart 89 and Table 46, col 9) in measures on a logit scale. And the IRT conditional error of estimation (SE) would be the square root: SQRT(0.31) = 0.56 in measures. And this inverted would yield the CTT CSEM: 1/0.56 = 1.80 in counts.

[[Or the SQRT(SUM(p*q)) = SQRT((0.15) * 21) = SQRT(3.24) = 1.80 (in counts) and the reciprocal is 1/1.80 = 0.56 in measures.]]

The IRT (CSEM) in Chart 89 is really the IRT standard error of estimation (SE or SEE). On Table 45c, the CSEM (SQRT) is also the SE (conditional error of estimation) obtained from the square root of the error variance for that ability level (17 right, 1.73 measures, or 0.81 or 81%).

“Thus more information implies less error of measurement.”

See Table 45c, CSEM, green, and Table 46, col 9-10.

“In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range.”

Chart 92
Table 47
The same generality applies to the item information functions (IIF)s in Chart 92 but it is not very evident. The item with a difficulty of 10 (IIF = 1.80, Table 47) is also highly discriminating. The two easiest items had negative discrimination; they show an increase in information as student ability decreases toward zero measure.  The generality applies best near the average test raw score of 50% or zero measure; which is not on the chart (no student got a score of 50% on this test).

This test had an average test score of 80%.  This has spread the item information function curves out (Chart 92). They are not centered on the raw score of 50% or the measures zero location. However each peaks near the point where item difficulty in measures is close to student difficulty in measures. This observation is critical in establishing the value of IRT item analysis and how it is used. This makes sense in measures (a natural log of the ratio of right and wrong mark scale) but not in raw 
scores (normal linear scale) as I first posted in Chart 75 with only count and percent scales.

“Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range.”

This is very evident in Table 47 and Chart 92.

“Because of local independence, item information functions are additive.”

See Test SEM (in Measures), Winsteps Table 17.1 MODEL S.E. MEAN (identical) = 0.64, Table 45c)

“Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.”

 “Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification.”

At this point my audit tool, Table 45, falls silent. These two mathematical models are a means for only estimating theoretical values; they are not the theoretical values nor are they the reasoning behind them. CTT starts from observed values and projects into the general environment. IRT can start with the perfect Rasch model and select observations that fit the model. The two models are looking in opposite directions. CTT uses a linear scale with the origin at zero counts. IRT sets its log ratio point-of-origin (zero) at the 50% CTT point. I must accept the concept that CTT is a simplification of IRT on the basis of authority at this point.

“In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, [student ability].”

I would word this, “In ADDITION to reliability,” (Table 45a, CTT = 0.29 and 45c, IRT = 0.27). Also the “IRT offers the ITEM information function which shows the degree of precision at different values . . .”

“These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chose items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single “cutscore,” and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cutscore. These items generally correspond to items whose difficulty is about the same as that of the cutscore.”

The eleven items in Table 47 and Chart 92 each peak near the point where item difficulty in measures is close to student difficulty in measures. The discovery or invention of this relationship is the key advantage of IRT over CTT.

These data show that a test item need not have to have (a commonly recommended) average score near 50% for useable results. Any cutscore from 50% to 80% would produce useable results on this test with an average score of 80% and cutscore (passing) of 70%.

"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."

My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location). 

I do not know what part of IRT is invention and what part is discovery on the part of some ingenious people. Two basic parts had to be fit together: information and measures by way of an inversion. Then a story had to be created to market the finished product; the Rasch model and Winsteps (full and partial credit) are the limit of my experience. The unfortunate name choice of “partial credit” rather than knowledge or skill and judgment may have been a factor in the Rasch partial credit model not becoming popular. The name, partial credit, falls into the realm of psychometrician tools. The name, Knowledge and Judgment, falls into the realm of classroom tools needed to guide the development of scholars as well as obtain maximum information from paper standardized tests; where students individually customized their tests (accurately, honestly, and fairly) rather than CAT where the test is tailored to fit the student; using best-guess, dated, and questionable second hand information.

IRT makes CAT possible. Please see "Adaptive Testing Evolves to Assess Common-Core Skills" for current marketing, use, and a list of comments, including two of mine. The exaggerated claims of test makers to assess and promote deveoping students by the continued use of forced-choice lower level of thinking tests continues to be ignored in the marketing of these tests to assess Common Core skills. Increased precision of nonsense still takes precedence over an assessment that is compatible with and supports the classroom and scholarship.

Serious mastery: Knowledge Factor.
Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).
Ranking: Forced-choice on paper or CAT.