#7

The ability of an item to place students into two distinct
groups is not a part of the mathematical model developed in the past few posts.
Discrimination ability, however, provides insight into how the model works. A
practical standardized test must have student scores spread out enough to
assign desired rankings. Discriminating items produce this spread of student
scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does
not tell us what a student actually knows that is useful and meaningful to the
student as the basis for further learning and effective instruction. This can
be done with Knowledge and Judgment Scoring
and the partial credit Rasch IRT model using the very same tests. This post is
using traditional scoring as it simplifies the analysis (and the model) to just
right and wrong, no judgment or higher levels of thinking are required of
students.

I created a simple data set of 12 students and 11 items
(Table 26) with an average score of 5. I then modified this set to produce
average scores of 6, 7, and 8 (Table 27). [This can also be considered as the
same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a
test with an average score of 8 was 1.83. I then adjusted the MSS for the other
three grades to match this value. A right and a wrong mark were exchanged in a
student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85,
1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495
and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected,
with the average test score. The average item discrimination (Pearson r and PBR)
(Table 28) was stable. In general, with a few outliers in this small data set,
the most discriminating items had the same difficulty as the average test
score. [This behavior for the item discrimination to be maximized at the
average test score is a basic component of the Rasch IRT model, which by design
limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that
items tend to be most discriminating when they have a difficulty near the
average test score (not just near 50%).

The question is often asked, “Do tests have to be designed
for an average score of 50%?” If
the SD remains the same, I found no difference in the KR20 or SEM. [The
observed SD is ignored by the Rasch IRT model used by many states for test
analysis.]

The maximum item discrimination value of 0.64 was always
associated with an item mark pattern in which all right marks and all wrong
marks were in two groups with no mixing of right and wrong marks. I loaded a
perfect Guttman mark pattern and found that 0.64 was the maximum corrected
value for this size of data set. (The corrected values are better estimates than
the uncorrected values in a small data set.)

Items of equal difficulty can have very different
discrimination values. In Table 26, three items have a difficulty of 7 right
marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in
estimating test reliability by deleting an item and recalculating the test
reliability to find the effect of any item in a test. The VESEngine (free download
below) includes this feature: Test Reliability (TR) toggle button. Test
reliability (KR20) and item discrimination (PBR) are interdependent on student
and item performance. A change in one usually results in a change in one or
more of the other factors. [Student ability and item difficulty are considered
independent using the Rasch model IRT analysis.] {I have yet to determine if
comparing CTT to IRT is a case of comparing apples to apples, apples to oranges
or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of
the error MSS (black curve) and the portion of right and wrong marks (red
curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at
each end. Both are insensitive to the position of right marks in an item mark
pattern. The average score for right and for wrong marks is sensitive to the
mark pattern as the difference between these two values determines part of the
item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can
use a range of average test scores. In this example there was no difference in
the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis
transforms normal counts into logits that have only one reference point of 50%
(zero logit) when student ability and item difficulty are positioned on one
common scale. This point is then extended in either direction by values that
represent equal student ability and item discrimination (50% right) from zero
to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the
observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents
the common education statistics on one Excel traditional two-dimensional
spreadsheet. The post includes definitions. Download
as .xlsm or .xls.

This blog started five years ago. It has meandered through
several views. The current project is visualizing
the VESEngine in three dimensions. The observed student mark patterns (on their
answer sheets) are on one level. The variation in the mark patterns (variance)
is on the second level.

Power Up Plus (PUP) is classroom friendly software used to
score and analyze what students guess (traditional multiple-choice) and what
they report as the basis for further learning and instruction (knowledge and
judgment scoring multiple-choice). This is a quick way to update your
multiple-choice to meet Common Core State Standards (promote understanding as
well as rote memory). Knowledge and judgment scoring originated as a classroom
project, starting in 1980, that converted passive pupils into self-correcting
highly successful achievers in two to nine months. Download as .xlsm or .xls.