Wednesday, June 18, 2014

Small Sample Math Model - Item Discrimination

                                                                   #7 
The ability of an item to place students into two distinct groups is not a part of the mathematical model developed in the past few posts. Discrimination ability, however, provides insight into how the model works. A practical standardized test must have student scores spread out enough to assign desired rankings. Discriminating items produce this spread of student scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does not tell us what a student actually knows that is useful and meaningful to the student as the basis for further learning and effective instruction. This can be done with Knowledge and Judgment Scoring and the partial credit Rasch IRT model using the very same tests. This post is using traditional scoring as it simplifies the analysis (and the model) to just right and wrong, no judgment or higher levels of thinking are required of students.

I created a simple data set of 12 students and 11 items (Table 26) with an average score of 5. I then modified this set to produce average scores of 6, 7, and 8 (Table 27). [This can also be considered as the same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a test with an average score of 8 was 1.83. I then adjusted the MSS for the other three grades to match this value. A right and a wrong mark were exchanged in a student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85, 1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495 and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected, with the average test score. The average item discrimination (Pearson r and PBR) (Table 28) was stable. In general, with a few outliers in this small data set, the most discriminating items had the same difficulty as the average test score. [This behavior for the item discrimination to be maximized at the average test score is a basic component of the Rasch IRT model, which by design limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that items tend to be most discriminating when they have a difficulty near the average test score (not just near 50%).

The question is often asked, “Do tests have to be designed for an average score of 50%?”  If the SD remains the same, I found no difference in the KR20 or SEM. [The observed SD is ignored by the Rasch IRT model used by many states for test analysis.]

The maximum item discrimination value of 0.64 was always associated with an item mark pattern in which all right marks and all wrong marks were in two groups with no mixing of right and wrong marks. I loaded a perfect Guttman mark pattern and found that 0.64 was the maximum corrected value for this size of data set. (The corrected values are better estimates than the uncorrected values in a small data set.)

Items of equal difficulty can have very different discrimination values. In Table 26, three items have a difficulty of 7 right marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in estimating test reliability by deleting an item and recalculating the test reliability to find the effect of any item in a test. The VESEngine (free download below) includes this feature: Test Reliability (TR) toggle button. Test reliability (KR20) and item discrimination (PBR) are interdependent on student and item performance. A change in one usually results in a change in one or more of the other factors. [Student ability and item difficulty are considered independent using the Rasch model IRT analysis.] {I have yet to determine if comparing CTT to IRT is a case of comparing apples to apples, apples to oranges or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of the error MSS (black curve) and the portion of right and wrong marks (red curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at each end. Both are insensitive to the position of right marks in an item mark pattern. The average score for right and for wrong marks is sensitive to the mark pattern as the difference between these two values determines part of the item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can use a range of average test scores. In this example there was no difference in the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis transforms normal counts into logits that have only one reference point of 50% (zero logit) when student ability and item difficulty are positioned on one common scale. This point is then extended in either direction by values that represent equal student ability and item discrimination (50% right) from zero to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.