Wednesday, October 31, 2012

An Assessment Worthy of the Common Core State Standards

The Common Core State Standards go beyond just knowing, believing and guessing. It demands an assessment that includes the judgment of psychometricians, teachers, and students. For the past decade, psychometricians have dominated making judgments from statistical information. The judgment of teachers was given equal weight in 2009 in Nebraska (see prior post).

The power of student judgment needs to be discussed and a way of adding it as the third primary stakeholder in standardized testing. Currently the old alternative and authentic movements are being resurrected into elaborate time consuming exercises. The purpose is to allow students to display their judgment in obtaining information, in processing it, and in making an acceptable (creative and innovative) report.

Traditional multiple-choice scoring, that only counts right marks, is correctly not included. Students have no option other than to mark. A good example is a test administered to a class of 20 students marking four-option questions (A, B, C, and D). Five students mark each option, on one question. That question has 5 right out of 20 students or a difficulty of 25%. There is no way to know what these students know. A marking pattern of an equal number of marks on each answer option indicates they were marking because they were forced to guess. They could not use the question to report what they actually trusted they knew. Student judgment is given no value in traditional right count scored multiple-choice testing.

The opposite situation exists when multiple-choice is scored for quantity and quality. Student judgment has a powerful effect on an item analysis by producing more meaningful information from the same test questions. Student judgment is given equal weight to knowing by Winsteps (partial credit Rasch model IRT, the software many states use in their standardized testing programs) and by Power Up Plus (Knowledge and Judgment Scoring, a classroom oriented program). Scoring now includes A, B, C, D, and omit.

Eight different mark patterns are obtained, related to student judgment, rather than two obtained from traditional multiple-choice scoring, when continuing with the above example. The first would be to again have the same number of marks and omits (4 right, 4 wrong, 4 wrong, 4 wrong marks, and 4 omits). This again looks like a record of student luck on test day. I have rarely seen such a pattern in over 100 tests and 3000 students. Experienced students know to omit for one point rather than to guess and get zero points when they cannot trust using a question to report what they actually know or can do.

The next set of three patterns omits one of the wrong options (4 right, 4 wrong, 4 wrong, and 8 omits. Students know that one option is not right. They cannot distinguish between the other two wrong options (B & C, B & D, and C & D). By omitting they have uncovered this information, which is hidden in traditional test scoring where only right marks are counted.

In the second set of three patterns students know that two options are not right and they can distinguish between the remaining right and wrong options. Instead of a meaningless distribution of marks across the four options, we now know which wrong option students believe to be a right answer (B or C or D). [Both student judgment and item difficulty are at 50% as they have equal value.]

The last answer pattern occurs when students either mark a right answer or omit. There is no question that they know the right answer when using the test to report what they trust they know or can do.

In summary, quantity and quality scoring allows students of all abilities to report and receive credit for what they know and can do, and also for their judgment in using their knowledge and skill. The resulting item analysis then specifically shows which wrong options are active. Inactive wrong options are not buried under a random distribution of marks produced by forced-choice scoring.

All four sets of mark patterns contain the same count of four right marks (any one of the options could be the right answer). Both scoring methods produce the same quality score (student judgment) when all items are marked (25%). When student judgment comes into play, however, the four sets of mark patterns require different levels of student judgment (25%, 33%, 50% and 100%).

Right count scoring item difficulty is obtained by adding up the right (or wrong) marks (5 out of 20 or 25%). Quantity and quality scoring item difficulty is obtained by combining student knowledge (right counts, quantity) and student judgment (quality). Both Winsteps and Power Up Plus (PUP) give knowledge and judgment equal value. The four sets of mark patterns then indicate item difficulties of 30%, 40%, 50% and 60%.

[Abler students always make questions look easier. Measuring student quality makes questions look easier than when just counting right marks and ignoring student judgment. The concept of knowledge and judgment is combined into one term, the location on a logit scale (natural log of the ratio of right to wrong marks), for person ability (and the natural log of the ratio of wrong to right marks for item difficulty) with Rasch model IRT using Winsteps. The normal scale of 0 to 50% to 100% is replaced with a logit scale of about -5 to zero to +5.]

Quantity and quality scoring provides specific information about which answer options are active, the level of thinking students are using, and the relative difficulty of questions that have the same number of right marks. IMHO this qualifies it as the method of choice for scoring Common Core State Standards multiple-choice items (and for preparation for such tests).

Forced guessing is no longer required to obtain results that look right. Experienced students prefer quantity and quality scoring. It is far more meaningful then playing the traditional role of an academic casino gambler.

No comments:

Post a Comment