19
Equipercentile equating frequently appears in NCLB testing
articles. I took a normal distribution of 40 student scores (average of 50%) with
a standard deviation (SD) of 10% (new test) and equated it to one with a SD of
20% (reference test) to see how equipercentile equating works (Chart 54).
First I grouped the scores into 5%ranges. I then matched
the new test groups to the reference test groups (Chart 55). The result was a
bit messy.
A replot of the twenty 5%groups shows the new test has been sliced
into groups that contain twice the count as the reference test, but which match,
in general, the reference test every other group (Chart 56).
Smoothing by inspection resulted in Chart 57. A perfect fit
was obtained with the reference test with the exception of rounding errors.
Smoothing on “small samples of testtakers” does make a difference in the
accuracy of equipercentile equating. “The improvement that resulted from
smoothing the distributions before equating was about the same as the
improvement that resulted from doubling the number of testtakers in the
samples” (Livingston,
2004, page 21). [See Post 13, Chart 34, in this series for the effect of
doubling the number of testtakers on the SD and SEM.]
I then entered the values from Charts 54, 55, and 57 into my
visual education statistics engine (VESE). Equipercentile equating the student
scores transformed the new test into the reference test including the related
group statistics (Chart 58).
The three 5%groupings show almost identical values. Grouping
reduced the item discrimination ability (PBR) of the reference test a small
amount as grouping reduced the range of the student score distribution. This
works very nicely in a perfect world, however, real test scores do not align
perfectly with the normal curve.
A much more detailed description of equipercentile equating
and smoothing is found in (Livingston,
2004, pages 1724). The easy to follow illustrated examples include real test results and
related problems, with a troubling resolution: “Often the choice of an equating
method comes down to a question of what is believable, given what we know about
the test and the population of testtakers.”
This highly subjective statement was acceptable in 2004.
NCLB put pressure on psychometricians to do better. The CCSS movement has raised
the bar again. The subjectivity expressed here is, IMHO, similar to that in
using the Rasch model IRT analysis that has been popular with state departments
of education. Both without IRT and with IRT methods base results on a
relationship to an unknowable “population of testtakers”. Both methods pursue
manipulations that end up with the results “looking right”.
[The classroom equivalent of this, practiced in Missouri
prior to NCLB, was to divide the normal curve into parts for letter grades. One
version was to assign grades to ranked student scores with uniform slices. True
believers assigned a double portion to “C”. Every class was then a “normal”
class with no way to know what the raw scores were or what students actually
knew or could do.]
It does not have to be that way. Let students report what
they actually know and can do. Let them report what they trust will be of value
for further learning and for application in situations other than in which they
learned. Do multiplechoice right. Get results comparable to essay, project,
report, and research. Promote student development. Knowledge and Judgment Scoring and partial credit Rasch model
analysis do this. Guessing is no longer needed. Forced guessing should not be
tolerated, IMHO.
The move to performance based learning may, this time, not
only compete with the CCSS movement assessments, but replace them. The system
that is the leanest, the most versatile in meeting student needs, and is immune
to erratic federal funding, and thus most effective, will survive.
                   

Free software to help you and your students
experience and understand how to break out of traditionalmultiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):