20
Frequency Estimation Equating involves conditioning on the
anchor or a set of common items. This post reports my adventures in figuring
out how this is done as I needed to know how to do this to complete the next
post on conditional standard error of measurement (CSEM).

I then followed the instructions in Livingston,
(2004, pp. 49-51). The values in Table 20 were tabulated to produce “a row for
each possible [student] score” and “a column for each possible score on the
anchor [common item]” (Table 21). The tally is turned into frequencies
conditioned on the common item scores by dividing each column cell by the
common item score or difficulty. The frequencies for each common item sum to
1.00.

[This operation can be worked backward (in part) to yield
the right mark tally. Dividing the population proportions by the number of
items in the sample yields the right mark frequencies. Multiplying the right
mark frequencies by the difficulty yields the right mark tally. But there is no
way to back up from the estimated population distribution to this set of population proportions, let
alone to individual student marks. The right mark tally is a property of the
observed sample and of individual student marks. This estimated population distribution is a property of the
unknowable population distribution related to the normal curve. The unknowable
population distribution can spawn endless sets of population proportions. Monte
Carlo psychometric experiments can be clean of the many
factors that affect classroom and standardized test results.]

“And when we have estimated the score distributions on both
the new form and the reference form, we can use those estimated distributions
to do an equipercentile equating, as if we had actually observed the score
distributions in the target population.” I carried this out, as in the previous
post, with nothing of importance to report.
So far in this series I have found that data reduction from
student marks to a finished product is independent from the content actually on
the test. The practice of using several methods and then picking the one that
“looks right” has been promoted. Here the creation of an unknown population
distribution is created from observed sample results. Here we are also giving
the choice of selecting Test A or Test B or combining the results. As the years
pass, it appears that more subjectivity is tolerated in getting test results
that “look right” when using traditional, non-IRT, multiple-choice scoring.
This charge, formerly, was directed at the Rasch model IRT analysis.
It does not have to be that way. Knowledge and Judgment Scoring and partial credit Rash model IRT
allow a student to report what is actually meaningful, useful, and empowering
to learn and apply what has been learned. This property of multiple-choice is
little appreciated.
What the traditional multiple-choice is delivering is also
little understood (psychometricians guessing
to what extent sample [actual test] results match an unknowable standard
distribution population based on student marks that include forced student guessing on test items the test
creators are guessing students will
find equally difficult, as based on a field test, they guess will represent the current test takers, on average).
We still see people writing, “I thought this test was to
tell us what [individual] students know.” Yet, traditional, forced-choice,
multiple-choice can only rank students by their performance on the test. It
does not ask them, or permit them, to individually report what they actually know or can do based on their
own self-judgment: just mark every item (as a missing mark is still considered
more degrading to an assessment than failing to assess student judgment).
- - - - - - - - - - - - - - - - - - - -
-
Free software to help you and your students
experience and understand how to break out of traditional-multiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment