Wednesday, July 24, 2013

Visual Education Statistics - Equipercentile Equating

Equipercentile equating frequently appears in NCLB testing articles. I took a normal distribution of 40 student scores (average of 50%) with a standard deviation (SD) of 10% (new test) and equated it to one with a SD of 20% (reference test) to see how equipercentile equating works (Chart 54).
First I grouped the scores into 5%-ranges. I then matched the new test groups to the reference test groups (Chart 55). The result was a bit messy. 
A re-plot of the twenty 5%-groups shows the new test has been sliced into groups that contain twice the count as the reference test, but which match, in general, the reference test every other group (Chart 56).
Smoothing by inspection resulted in Chart 57. A perfect fit was obtained with the reference test with the exception of rounding errors. 
Smoothing on “small samples of test-takers” does make a difference in the accuracy of equipercentile equating. “The improvement that resulted from smoothing the distributions before equating was about the same as the improvement that resulted from doubling the number of test-takers in the samples” (Livingston, 2004, page 21). [See Post 13, Chart 34, in this series for the effect of doubling the number of test-takers on the SD and SEM.]

I then entered the values from Charts 54, 55, and 57 into my visual education statistics engine (VESE). Equipercentile equating the student scores transformed the new test into the reference test including the related group statistics (Chart 58).
The three 5%-groupings show almost identical values. Grouping reduced the item discrimination ability (PBR) of the reference test a small amount as grouping reduced the range of the student score distribution. This works very nicely in a perfect world, however, real test scores do not align perfectly with the normal curve.

A much more detailed description of equipercentile equating and smoothing is found in (Livingston, 2004, pages 17-24). The easy to follow illustrated examples include real test results and related problems, with a troubling resolution: “Often the choice of an equating method comes down to a question of what is believable, given what we know about the test and the population of test-takers.”

This highly subjective statement was acceptable in 2004. NCLB put pressure on psychometricians to do better. The CCSS movement has raised the bar again. The subjectivity expressed here is, IMHO, similar to that in using the Rasch model IRT analysis that has been popular with state departments of education. Both without IRT and with IRT methods base results on a relationship to an unknowable “population of test-takers”. Both methods pursue manipulations that end up with the results “looking right”.

[The classroom equivalent of this, practiced in Missouri prior to NCLB, was to divide the normal curve into parts for letter grades. One version was to assign grades to ranked student scores with uniform slices. True believers assigned a double portion to “C”. Every class was then a “normal” class with no way to know what the raw scores were or what students actually knew or could do.]  

It does not have to be that way. Let students report what they actually know and can do. Let them report what they trust will be of value for further learning and for application in situations other than in which they learned. Do multiple-choice right. Get results comparable to essay, project, report, and research. Promote student development. Knowledge and Judgment Scoring and partial credit Rasch model analysis do this. Guessing is no longer needed. Forced guessing should not be tolerated, IMHO.

The move to performance based learning may, this time, not only compete with the CCSS movement assessments, but replace them. The system that is the leanest, the most versatile in meeting student needs, and is immune to erratic federal funding, and thus most effective, will survive.
- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

No comments:

Post a Comment