The Common Core State Standards go beyond just knowing,
believing and guessing. It demands an assessment that includes the judgment of
psychometricians, teachers, and students. For the past decade, psychometricians
have dominated making judgments from statistical information. The judgment of
teachers was given equal weight in 2009 in Nebraska (see prior post).
The power of student judgment needs to be discussed and a
way of adding it as the third primary stakeholder in standardized testing.
Currently the old alternative and authentic movements are being resurrected
into elaborate time consuming exercises. The purpose is to allow students to
display their judgment in obtaining information, in processing it, and in making
an acceptable (creative and innovative) report.
Traditional multiple-choice scoring, that only counts right
marks, is correctly not included. Students have no option other than to mark. A
good example is a test administered to a class of 20 students marking four-option
questions (A, B, C, and D). Five
students mark each option, on one question. That question has 5 right out of 20
students or a difficulty of 25%. There is no way to know what these students
know. A marking pattern of an equal number of marks on each answer option
indicates they were marking because they were forced to guess. They could not
use the question to report what they actually trusted they knew. Student judgment
is given no value in traditional right count scored multiple-choice testing.
The opposite situation exists when multiple-choice is scored
for quantity and quality. Student judgment has a powerful effect on an item
analysis by producing more meaningful information from the same test questions.
Student judgment is given equal weight to knowing by Winsteps (partial credit Rasch
model IRT, the software many states use in their standardized testing programs)
and by Power Up Plus (Knowledge and
Judgment Scoring, a classroom oriented program). Scoring now includes A, B, C, D, and omit.
Eight different mark patterns are obtained, related to
student judgment, rather than two obtained from traditional multiple-choice
scoring, when continuing with the above example. The first would be to again
have the same number of marks and omits (4 right, 4 wrong, 4 wrong, 4 wrong
marks, and 4 omits). This again looks like a record of student luck on test
day. I have rarely seen such a pattern in over 100 tests and 3000 students.
Experienced students know to omit for one point rather than to guess and get
zero points when they cannot trust using a question to report what they
actually know or can do.
The next set of three patterns omits one of the wrong
options (4 right, 4 wrong, 4 wrong, and 8 omits. Students know that one option
is not right. They cannot distinguish between the other two wrong options (B
& C, B & D, and C & D). By omitting they have uncovered this
information, which is hidden in traditional test scoring where only right marks
are counted.
In the second set of three patterns students know that two
options are not right and they can distinguish between the remaining right and
wrong options. Instead of a meaningless distribution of marks across the four
options, we now know which wrong option students believe to be a right answer
(B or C or D). [Both student judgment and item difficulty are at 50% as they
have equal value.]
The last answer pattern occurs when students either mark a
right answer or omit. There is no question that they know the right answer when
using the test to report what they trust they know or can do.
In summary, quantity and quality scoring allows students of
all abilities to report and receive credit for what they know and can do, and also
for their judgment in using their knowledge and skill. The resulting item
analysis then specifically shows which wrong options are active. Inactive wrong
options are not buried under a random distribution of marks produced by
forced-choice scoring.
All four sets of mark patterns contain the same count of
four right marks (any one of the options could be the right answer). Both scoring
methods produce the same quality score (student judgment) when all items are
marked (25%). When student judgment comes into play, however, the four sets of
mark patterns require different levels of student judgment (25%, 33%, 50% and
100%).
Right count scoring item difficulty is obtained by adding up
the right (or wrong) marks (5 out of 20 or 25%). Quantity and quality scoring
item difficulty is obtained by combining student knowledge (right counts,
quantity) and student judgment (quality). Both Winsteps and Power Up Plus (PUP) give knowledge and judgment equal
value. The four sets of mark patterns then indicate item difficulties of 30%,
40%, 50% and 60%.
[Abler students always make questions look easier. Measuring
student quality makes questions look easier than when just counting right marks
and ignoring student judgment. The concept of knowledge and judgment is
combined into one term, the location on a logit scale (natural log of the ratio
of right to wrong marks), for person ability (and the natural log of the ratio
of wrong to right marks for item difficulty) with Rasch model IRT using
Winsteps. The normal scale of 0 to 50% to 100% is replaced with a logit scale
of about -5 to zero to +5.]
Quantity and quality scoring provides specific information
about which answer options are active, the level of thinking students are
using, and the relative difficulty of questions that have the same number of
right marks. IMHO this qualifies it as the method of choice for scoring Common
Core State Standards multiple-choice items (and for preparation for such tests).
Forced guessing is no longer required to obtain results that
look right. Experienced students prefer quantity and quality scoring. It is far
more meaningful then playing the traditional role of an academic casino
gambler.