9

The main purpose of this post is to investigate the
similarities between traditional multiple-choice (TMC), or classical test
theory (CTT), and item response theory (IRT). The discussion is based on TMC
and IRT as the math is simpler than when using knowledge and judgment scoring (KJS) and
the IRT partial
credit model (PCM). The difference is that TMC and IRT input marks at the
lowest levels of thinking; resulting in a traditional ranking. KJS and PCM
input the same marks at all levels of thinking; resulting in a ranking plus a
quality indication of what a student actually knows and understands that is of
value to that student (and teacher) in further instruction and learning.

I applied the instructions in the Winsteps Manual, page 576,
for checking out the Winsteps reliability estimate computation, to the
Nursing124 data used in the past several posts (22 students and 21 items). Table
32 is a busy table that is discussed in the next several posts. The two
estimates for test reliability (0.29 and 0.28, orange) are identical based on
TMC and IRT (considering rounding errors).

Table 32a shows the TMC test reliability estimated from the

**ratio**of true variance to total variance. The total variance**between scores,**4.08**,**minus the error variance**within items,**2.95**,**yields the true variance, 1.13. The KR20 then completes the reliability calculation to yield 0.29 using normal values.
For an IRT estimate of test reliability, the values on a
normal scale are converted to the logit scale (ln ratio w/r). In this case, the
sum of item difficulty logits, ln ratio w/r, was -1.62 (Table 32b). This value
is subtracted from each item difficulty logit value to shift the mean of the
item distribution to the zero logit point (Rasch Adjust, Table 32b). Winsteps
then optimizes the fit of the data (blue) to the perfect Rasch Model. Now comparable

**student ability**and**item difficulty**values are in register at the same locations on a single logit scale. The 50% point on the normal scale is now at the zero location for both student ability and item difficulty.
The probability for each right mark (expected score ) in the
central cells is the product of the respective marginal cells (blue) for item
difficulty (Winsteps Table 13.1) and student ability (Winsteps Table 17.1). The sum of these
probabilities (Table 32b, pink) is identical to the normal Score Mean (Table 32a,
pink).

The
“information” in each central cell, in Table 32c, was obtained by p*q or p * (1
- p) from Table 32b. Adding up the internal cells for each score yields the sum
of information for that score.

The
next column shows the square root of the sum of information. This value
inverted yields the conditional standard error of measurement (CSEM). The conditional
variance (CVar)

**within**each student ability measure is then obtained by reversing the equation for normal values in Table 32a: The CVar is obtained as the square of the CSEM instead of the CSEM being obtained as the square root of the CVar. The average of these values is the test model error variance (EV) in measures: 0.43.
The
observed variance (OV)

**between**measures is estimated in the exact same way as is done for normal scores: the variance between measures from Excel =VAR.P (0.61) or the square of the SD: 0.78 squared = 0.61.
The
test reliability in measures {(OV –EV)/OV = (0.61 – 0.45)/0.61 = 0.28) is then obtained
from the same equation for normal values: {total variance – error
variance)/total variance = (4.08 – 2.96)/4.08 = 0.29, in table 32a. Normal and
measure dimensions for the same value differ, but ratios do not, as a ratio has
no dimension.

**TMC and IRT produced the same values for test reliability. As will KJS and the PCM.**

**(Continued)**

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

**The Best of the Blog - FREE**

The Visual Education Statistics Engine (VESEngine) presents
the common education statistics on one Excel traditional two-dimensional
spreadsheet. The post includes definitions. Download
as .xlsm or .xls.

This blog started five years ago. It has meandered through
several views. The current project is visualizing
the VESEngine in three dimensions. The observed student mark patterns (on their
answer sheets) are on one level. The variation in the mark patterns (variance)
is on the second level.

Power Up Plus (PUP) is classroom friendly software used to
score and analyze what students guess (traditional multiple-choice) and what
they report as the basis for further learning and instruction (knowledge and
judgment scoring multiple-choice). This is a quick way to update your
multiple-choice to meet Common Core State Standards (promote understanding as
well as rote memory). Knowledge and judgment scoring originated as a classroom
project, starting in 1980, that converted passive pupils into self-correcting
highly successful achievers in two to nine months. Download as .xlsm or .xls.