14
[The solution is in Chart 89, Item Analysis flow sheet.]
An apparent
paradox is that extreme scores have perfect precision, but extreme measures
have perfect imprecision” in “Reliability and separation of measures.” A more
complete discussion is given
under the title, “Standard Errors and Reliabilities: Rasch and Raw Score”.
Chart 82 |
Table 45 |
Chart 83 |
Chart 83 (CTT) and Chart 84 (IRT) summarize the statistics behind Table 45.
Chart 84 |
Table 45 includes the process of combining student
scores and item difficulties onto one logit scale.
Table 46 |
I then isolated the item analysis from the complete development
above by skipping the formation of a single scale from real classroom data.
Instead, I feed the IRT item analysis a percent (dummy) data set (Table 46)
with the same number of items as in the classroom test (21 items). I then
graphed the data strings in Table 46 as a second, simpler, view of IRT item
analysis.
Chart 85 |
Turning right counts (Chart 85, blue) into a right/wrong ratio string (red) yields a
very different shape than a straight line right mark count. We now have the rate at which each mark completes a
perfect score of 21 or 100%. It starts slow (1/20), with the last mark racing
20 times (20/1) the average rate (10/11 or 11/10, near 1, in Table 46, col 2).
Taking the natural log of the ratio (a logit, Table 46, col
3) creates the Rasch model IRT characteristic curve (Chart 85, purple) with the
zero logit point of origin positioned at the 50% normal value. [Ratios and log
ratios have no dimensions.]
Chart 86 |
Winsteps, at this point, has reduced student raw scores and
item difficulties (in counts) into
one logit scale of student ability and item difficulty with the dimension of a measure. These are then combined into
the probability of a right answer to start the item analysis. The percent
(dummy) input (Table 46, col 6) replaces this operation (Chart 86). This
simplifies the current discussion to just item analysis and precision.
Chart 87 |
Percent input and Information for one central cell are
plotted in Chart 87. Cell information is limited to a maximum of 0.25 at a student
raw score of 50% (Table 46, col 7), when combining p*q (0.50 * 0.50 = 0.25 ).
The next step is to adjust the cell information for 21 items on the test
(Column 8).
Chart 88 |
Chart 88 completes the comparison of CTT and IRT
calculations on Table 46. The inversion of Information (col 9) yields the error
variance that aligns with student score measures such that the greatest
precision (smallest error variance) is at the point of origin of the logit
scale. The square root of the error variance (col 10) yields the CSEM
equivalent for IRT measures. And then, by a second inversion these measure
values are transformed into the identical normal CSEM values (col 11 - 12) for
a CTT item analysis. The total view in Table 45 was too complicated. Charts 85
– 88 are also.
Chart 89 |
My third, simple, and last view is a flowchart (Chart 89)
constructed from the above charts and tables.
The percent (dummy) data produce identical (1.80) standard
error of measurement (CSEM) results with CTT and IRT item analysis (Table 46,
col 11 - 12 and Chart 89) even though CTT starts with a raw score count (17),
and skips the score mean (0.81), and the IRT item analysis starts with a score
mean (0.81).
CTT captures the variation (in marks) within a student score
in the variance (0.15); IRT captures the variation (in probabilities) as
information (0.15). In all cases the score variance and score information are
treated with the square root (SQRT, pink) to yield standard errors (estimates
of precision: CTT CSEM, on a normal scale in counts, and IRT (CSEM) on a logit
scale in measures.
In summary, as CTT score variance and IRT score information
(red) increase, CSEM increases on a normal scale (Chart 89). Precision
decreases. At the same time IRT
error variance (green) and IRT (CSEM) decrease on a logit scale. Precision
increases with respect to the Rasch model point of origin zero (50% on a normal
scale). This inversion aligns the IRT (CSEM) to student scores in measures on a
logit scale.
It appears that the meaning of this depends upon what is
being measured and how well it is being measured. CTT measures in counts and
sets error (based on the score variance, Chart 89, red) about the student score
count on a normal scale (CSEM). IRT converts counts to “measures”. IRT then
measures in “measures” and sets error (based on the error variance, Chart 89,
green) about the point of origin (zero) on a logit scale that corresponds to
50% on a normal scale.
Chart 90 |
The two methods of feeding an item analysis are using two
different reference points. This was easier to see when I took the core out of
Chart 88 and plotted it in a more common form in Chart 90. Precision on both
scales is shown in solid black. This line intersects the Rach model IRT
characteristic curve where normal is 50% and IRT is zero. At a count of 17
right, the normal scale shows higher precision; the logit scale shows lower
precision in respect to the perfect Rasch model.
The characteristic curve is a collection of points where student ability and item
difficulties match resulting in students with this ability getting 50% right answers with items with matching difficulties. This situation exists for CTT only at the average test score (mean).
[The slope of the
test characteristic curve is given as the inverse of the raw score error
variance (3.24, red, Chart 88 - 89, and Table 46).]
Chart 91 |
Table 91 applies the above thinking to real classroom data
(Table 45c). This time the average score was not at 50% but at 81%. The lowest
student score on Table 45c was 12 (57%).
In a lost reference, I have read that at the 50% point
students do not know anything; it is all chance. I can see that for true-false.
That could put CTT and IRT in conflict. A student must know something to earn a
score of 50% when there are four options to each item. There is a free 25%. The
student must supply the remaining 25%. Also few CCT tests are filled with items
that have maximum discrimination and precision. A high quality CTT test can
look very much like a high quality IRT test. The difference is that the IRT
test item analysis takes more into the calculations than the CTT test when
offered as forced-choice (a cheap way to rank students) or as with knowledge and judgment scoring (where students report what they actually know and find
meaningful and useful; the basis for effective teaching).
Historically, test reliability was the chief marketing point
of standardized tests. In the past decade the precision of individual student
scores has replaced test reliability. IRT (CSEM) provides a more marketable
product along with promoting the sale of equipment and related CAT services.
Again psychometricians on the backside are continuing to support and lend
credibility to the claims from the sales office on the front end.