Wednesday, April 8, 2015

CTT and Rasch IRT Item Analysis Paradox

[The solution is in Chart 89, Item Analysis flow sheet.]

An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision” in “Reliability and separation of measures.” A more complete discussion is given under the title, “Standard Errors and Reliabilities: Rasch and Raw Score”.

Chart 82
The apparent paradox is graphed in Chart 82. Precision on one scale is the inverse or reciprocal of the other: 1/0.44 = 2.27 and 1/2.27 = 0.44.

Table 45
I edited Table 32 to disclose a full development of a comparison between CTT and IRT using real classroom data (Table 45). This first view is too complicated.
Chart 83
Chart 83 (CTT) and Chart 84 (IRT) summarize the statistics behind Table 45.

Chart 84
Table 45 includes the process of combining student scores and item difficulties onto one logit scale.

Table 46
I then isolated the item analysis from the complete development above by skipping the formation of a single scale from real classroom data. Instead, I feed the IRT item analysis a percent (dummy) data set (Table 46) with the same number of items as in the classroom test (21 items). I then graphed the data strings in Table 46 as a second, simpler, view of IRT item analysis.

Chart 85
Turning right counts (Chart 85, blue) into a right/wrong ratio string (red) yields a very different shape than a straight line right mark count. We now have the rate at which each mark completes a perfect score of 21 or 100%. It starts slow (1/20), with the last mark racing 20 times (20/1) the average rate (10/11 or 11/10, near 1, in Table 46, col 2).

Taking the natural log of the ratio (a logit, Table 46, col 3) creates the Rasch model IRT characteristic curve (Chart 85, purple) with the zero logit point of origin positioned at the 50% normal value. [Ratios and log ratios have no dimensions.]

Chart 86
Winsteps, at this point, has reduced student raw scores and item difficulties (in counts) into one logit scale of student ability and item difficulty with the dimension of a measure. These are then combined into the probability of a right answer to start the item analysis. The percent (dummy) input (Table 46, col 6) replaces this operation (Chart 86). This simplifies the current discussion to just item analysis and precision.

Chart 87
Percent input and Information for one central cell are plotted in Chart 87. Cell information is limited to a maximum of 0.25 at a student raw score of 50% (Table 46, col 7), when combining p*q (0.50 * 0.50 = 0.25 ). The next step is to adjust the cell information for 21 items on the test (Column 8).

Chart 88
Chart 88 completes the comparison of CTT and IRT calculations on Table 46. The inversion of Information (col 9) yields the error variance that aligns with student score measures such that the greatest precision (smallest error variance) is at the point of origin of the logit scale. The square root of the error variance (col 10) yields the CSEM equivalent for IRT measures. And then, by a second inversion these measure values are transformed into the identical normal CSEM values (col 11 - 12) for a CTT item analysis. The total view in Table 45 was too complicated. Charts 85 – 88 are also.

Chart 89
My third, simple, and last view is a flowchart (Chart 89) constructed from the above charts and tables.

The percent (dummy) data produce identical (1.80) standard error of measurement (CSEM) results with CTT and IRT item analysis (Table 46, col 11 - 12 and Chart 89) even though CTT starts with a raw score count (17), and skips the score mean (0.81), and the IRT item analysis starts with a score mean (0.81).

CTT captures the variation (in marks) within a student score in the variance (0.15); IRT captures the variation (in probabilities) as information (0.15). In all cases the score variance and score information are treated with the square root (SQRT, pink) to yield standard errors (estimates of precision: CTT CSEM, on a normal scale in counts, and IRT (CSEM) on a logit scale in measures.

In summary, as CTT score variance and IRT score information (red) increase, CSEM increases on a normal scale (Chart 89). Precision decreases.  At the same time IRT error variance (green) and IRT (CSEM) decrease on a logit scale. Precision increases with respect to the Rasch model point of origin zero (50% on a normal scale). This inversion aligns the IRT (CSEM) to student scores in measures on a logit scale.

It appears that the meaning of this depends upon what is being measured and how well it is being measured. CTT measures in counts and sets error (based on the score variance, Chart 89, red) about the student score count on a normal scale (CSEM). IRT converts counts to “measures”. IRT then measures in “measures” and sets error (based on the error variance, Chart 89, green) about the point of origin (zero) on a logit scale that corresponds to 50% on a normal scale.

Chart 90
The two methods of feeding an item analysis are using two different reference points. This was easier to see when I took the core out of Chart 88 and plotted it in a more common form in Chart 90. Precision on both scales is shown in solid black. This line intersects the Rach model IRT characteristic curve where normal is 50% and IRT is zero. At a count of 17 right, the normal scale shows higher precision; the logit scale shows lower precision in respect to the perfect Rasch model. 

The characteristic curve is a collection of points where student ability and item difficulties match resulting in students with this ability getting 50% right answers with items with matching difficulties. This situation exists for CTT only at the average test score (mean).

[The slope of the test characteristic curve is given as the inverse of the raw score error variance (3.24, red, Chart 88 - 89, and Table 46).]

Chart 91
Table 91 applies the above thinking to real classroom data (Table 45c). This time the average score was not at 50% but at 81%. The lowest student score on Table 45c was 12 (57%).

In a lost reference, I have read that at the 50% point students do not know anything; it is all chance. I can see that for true-false. That could put CTT and IRT in conflict. A student must know something to earn a score of 50% when there are four options to each item. There is a free 25%. The student must supply the remaining 25%. Also few CCT tests are filled with items that have maximum discrimination and precision. A high quality CTT test can look very much like a high quality IRT test. The difference is that the IRT test item analysis takes more into the calculations than the CTT test when offered as forced-choice (a cheap way to rank students) or as with knowledge and judgment scoring (where students report what they actually know and find meaningful and useful; the basis for effective teaching).

Historically, test reliability was the chief marketing point of standardized tests. In the past decade the precision of individual student scores has replaced test reliability. IRT (CSEM) provides a more marketable product along with promoting the sale of equipment and related CAT services. Again psychometricians on the backside are continuing to support and lend credibility to the claims from the sales office on the front end.