20
Frequency Estimation Equating involves conditioning on the
anchor or a set of common items. This post reports my adventures in figuring
out how this is done as I needed to know how to do this to complete the next
post on conditional standard error of measurement (CSEM).
Two 24 student by 15 item tests were drawn from the
Nursing124 data. Included in each was a set of 6 common items that were marked
the same in both tests: A and B (Table 20). Student scores varied between Test
A and Test B based on their marks on other, noncommon, items. The common items
were sorted by their difficulty.
I then followed the instructions in Livingston,
(2004, pp. 4951). The values in Table 20 were tabulated to produce “a row for
each possible [student] score” and “a column for each possible score on the
anchor [common item]” (Table 21). The tally is turned into frequencies
conditioned on the common item scores by dividing each column cell by the
common item score or difficulty. The frequencies for each common item sum to
1.00.
Next, the unknown population proportions are obtained by
combining (multiplying) the common item frequencies with the equal portion each
common item contributed (1/6) to the test (Table 21). These values now
represent the onaverage expectations for each cell based on the observed data.
Summing by rows produces the estimated (best guess) unknown population student
score distribution that could have also produced the onaverage expectations.
This was done for both Test A and Test B.
[This operation can be worked backward (in part) to yield
the right mark tally. Dividing the population proportions by the number of
items in the sample yields the right mark frequencies. Multiplying the right
mark frequencies by the difficulty yields the right mark tally. But there is no
way to back up from the estimated population distribution to this set of population proportions, let
alone to individual student marks. The right mark tally is a property of the
observed sample and of individual student marks. This estimated population distribution is a property of the
unknowable population distribution related to the normal curve. The unknowable
population distribution can spawn endless sets of population proportions. Monte
Carlo psychometric experiments can be clean of the many
factors that affect classroom and standardized test results.]
Charts 59 and 60 show the effect produced by conditioning on
the common items. This transformation from observed to onaverage expectations
appears to rotate the distribution about the average test score of 84% and 80%,
respectively, for both Test A and Test B. It made a detectable increase in the
frequency of high scores and a similar decrease in the frequency of low scores.
This increased the average scores to 86% and 84%, respectively. Is this an
improvement or a distortion?
“And when we have estimated the score distributions on both
the new form and the reference form, we can use those estimated distributions
to do an equipercentile equating, as if we had actually observed the score
distributions in the target population.” I carried this out, as in the previous
post, with nothing of importance to report.
So far in this series I have found that data reduction from
student marks to a finished product is independent from the content actually on
the test. The practice of using several methods and then picking the one that
“looks right” has been promoted. Here the creation of an unknown population
distribution is created from observed sample results. Here we are also giving
the choice of selecting Test A or Test B or combining the results. As the years
pass, it appears that more subjectivity is tolerated in getting test results
that “look right” when using traditional, nonIRT, multiplechoice scoring.
This charge, formerly, was directed at the Rasch model IRT analysis.
It does not have to be that way. Knowledge and Judgment Scoring and partial credit Rash model IRT
allow a student to report what is actually meaningful, useful, and empowering
to learn and apply what has been learned. This property of multiplechoice is
little appreciated.
What the traditional multiplechoice is delivering is also
little understood (psychometricians guessing
to what extent sample [actual test] results match an unknowable standard
distribution population based on student marks that include forced student guessing on test items the test
creators are guessing students will
find equally difficult, as based on a field test, they guess will represent the current test takers, on average).
We still see people writing, “I thought this test was to
tell us what [individual] students know.” Yet, traditional, forcedchoice,
multiplechoice can only rank students by their performance on the test. It
does not ask them, or permit them, to individually report what they actually know or can do based on their
own selfjudgment: just mark every item (as a missing mark is still considered
more degrading to an assessment than failing to assess student judgment).
                   

Free software to help you and your students
experience and understand how to break out of traditionalmultiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):