9
Statistic Six: Item discrimination, the last statistic
in this series of posts, captures the ability of an item to group students by
what they know (and by what they have yet to learn, with Knowledge and Judgment Scoring or partial credit Rasch model
scoring). Previous posts have indicated that this ability may be primary in
selecting items for standardized tests. It is also important in the classroom.
Discriminating items produce the spread of scores needed for setting grades in
schools designed for failure.
I left this statistic to last as it is a bit different from
the others. It is more complex and difficult to calculate. However, the
standard error of measurement (SEM) engine, post 8, only needed one more step
to have the numbers in hand to calculate the Pearson r estimate of item
discrimination.
Pearson worked out his item discrimination in a manner that
follows the previous posts. He did this by 1895, long before we had personal
computers. As a consequence we now have two versions, called the original
uncorrected estimate (Excel Pearson function) and the corrected estimate. There
is also a shortcut for traditional multiple-choice (TMC) tests: the point
biserial r (PBR) I consider at the end of this post.
A visual presentation of the Pearson item discrimination
calculation follows (see Table 11 for the calculations).
First, the marks in
the Item 4 column on the Guttman table (Table 12) are counted (10), the average
obtained (0.45 out of 22), and the deviations from the mean obtained (Chart
20).
The same process is carried
out on the student score columns (RT of 369 and SCORE MEAN of 16.77 out of 22,
see Chart 21).
When each of these two charts is summed, it adds to zero.
This time the individual values are not squared to make them all positive as in
Charts 22 (scores) and 23 (items). Instead the related item and score deviations
are multiplied to produce positive and negative values (Chart 24 and Table 11)
that sum to 13.27.
The item discrimination is then a ratio between two sums of
squares (SS). This operation is carried out for each item on the test:
Multiplying the two SSs in the denominator (after taking
their square roots) changes negative values to positive values and yields a
grand SS (2.34 x 9.49 = 22.21). The resulting ratio is the discrimination
ability of the item. It can range from a minus one to a positive one. Values
above 0.9 are characteristic of standardized tests. Values for classroom tests
will be discussed later.
Table 12 contains an Item Discrimination Engine you can
use to explore the discrimination ability of individual items. [Download free
from http://www.nine-patch.com/download/IDEngine.xlsm
or .xls]
The point
biserial r (PBR) provides an additional glimpse into what is taking place
(Table 13). The difference between
the average right marks and wrong marks (18.1 – 15.67 = 2.43) is standardized
by dividing by the standard deviation (2.43/2.07 = 1.176). Multiplying the difference between right and wrong mark
means in standard units (1.176) by the proportion
(p and q) of right and wrong marks, Sqrt(0.45 x 0.55) = 0.2475, yields the PBR item discrimination of
0.59.
The real value or meaning of an item discrimination rank seems
to be a matter of tradition and advances in computing power. PUP 5.20 prints
out corrected item discrimination values that I gave the following rankings for
my classroom tests:
[The PBR only works for traditional multiple-choice, that
only ranks students. PUP contains the Pearson r that is required for Knowledge
and Judgment Scoring, an actual assessment of what students know and can do,
that is meaningful and useful in future assignments.]
Item discrimination weights each right and wrong mark with
the related student score. Different column mark patterns produce different
results. Unlike test reliability, when calculating item discrimination the
order, or pattern, of marks is important. Items of the same difficulty can have
very different discrimination ability, for example, items 11, 14, 15, 16 and 18
with a difficulty of 91% and a range of item discrimination of -0.02 to 0.58
(Chart 25).
Selecting difficult items is not sufficient to maximize test
reliability. The primary need is to write discriminating items. The Nursing124
data delivered discriminating items at all levels of difficulty from 45% to 91%
(Chart 25).
The item discrimination results seemed to me to be as
unpredictable as test reliability results. IMHO only a visual education
statistics engine that combines all six statistics can readily display the
interactions.
The standard error of student score measurement (SEM),
the test reliability (KR20, and alpha), and the item discrimination (Pearson
and PBR) have unpredictable interactions. The Test Performance Profile from PUP
5.20 brings these together in one table for easy use in the classroom by
students and teachers (and other interested persons) but lacks the flexibility
of a single sheet spreadsheet engine.
[PUP 5.20 only
prints the PBR ranks as an efficient aid for teachers. An additional aid is
provided by sorting the discriminating items on PUP 5.20, sheet 3a. Student
Counseling Mark Matrix with Mastery/Easy, Unfinished, and Discriminating (MUD)
Analysis.]
- - - - - - - - - - - - - - - - - - - -
-
Free software to help you and your students
experience and understand how to break out of traditional-multiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment