Wednesday, August 13, 2014

Test Score Reliability - TMC and IRT

                                                                9
The main purpose of this post is to investigate the similarities between traditional multiple-choice (TMC), or classical test theory (CTT), and item response theory (IRT). The discussion is based on TMC and IRT as the math is simpler than when using knowledge and judgment scoring (KJS) and the IRT partial credit model (PCM). The difference is that TMC and IRT input marks at the lowest levels of thinking; resulting in a traditional ranking. KJS and PCM input the same marks at all levels of thinking; resulting in a ranking plus a quality indication of what a student actually knows and understands that is of value to that student (and teacher) in further instruction and learning.

I applied the instructions in the Winsteps Manual, page 576, for checking out the Winsteps reliability estimate computation, to the Nursing124 data used in the past several posts (22 students and 21 items). Table 32 is a busy table that is discussed in the next several posts. The two estimates for test reliability (0.29 and 0.28, orange) are identical based on TMC and IRT (considering rounding errors).

Table 32a shows the TMC test reliability estimated from the ratio of true variance to total variance. The total variance between scores, 4.08, minus the error variance within items, 2.95, yields the true variance, 1.13. The KR20 then completes the reliability calculation to yield 0.29 using normal values.

For an IRT estimate of test reliability, the values on a normal scale are converted to the logit scale (ln ratio w/r). In this case, the sum of item difficulty logits, ln ratio w/r, was -1.62 (Table 32b). This value is subtracted from each item difficulty logit value to shift the mean of the item distribution to the zero logit point (Rasch Adjust, Table 32b). Winsteps then optimizes the fit of the data (blue) to the perfect Rasch Model. Now comparable student ability and item difficulty values are in register at the same locations on a single logit scale. The 50% point on the normal scale is now at the zero location for both student ability and item difficulty.

The probability for each right mark (expected score ) in the central cells is the product of the respective marginal cells (blue) for item difficulty (Winsteps Table 13.1) and student ability (Winsteps Table 17.1). The sum of these probabilities (Table 32b, pink) is identical to the normal Score Mean (Table 32a, pink).

The “information” in each central cell, in Table 32c, was obtained by p*q or p * (1 - p) from Table 32b. Adding up the internal cells for each score yields the sum of information for that score.  

The next column shows the square root of the sum of information. This value inverted yields the conditional standard error of measurement (CSEM). The conditional variance (CVar) within each student ability measure is then obtained by reversing the equation for normal values in Table 32a: The CVar is obtained as the square of the CSEM instead of the CSEM being obtained as the square root of the CVar. The average of these values is the test model error variance (EV) in measures: 0.43.

The observed variance (OV) between measures is estimated in the exact same way as is done for normal scores: the variance between measures from Excel =VAR.P (0.61) or the square of the SD: 0.78 squared = 0.61.

The test reliability in measures {(OV –EV)/OV = (0.61 – 0.45)/0.61 = 0.28) is then obtained from the same equation for normal values: {total variance – error variance)/total variance = (4.08 – 2.96)/4.08 = 0.29, in table 32a. Normal and measure dimensions for the same value differ, but ratios do not, as a ratio has no dimension. TMC and IRT produced the same values for test reliability. As will KJS and the PCM.

(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.