The past few posts have shown that if two tests have the
same student score standard deviation (SD) they are easy to combine or link.
Both tests will have the same student score distribution on the same scale.
Equating is then a process of finding the difference between
the average test scores and applying this value to one of the two sets of test
scores. Add the difference in average test score to the lower set of scores, or
subtract it from the higher set to combine the two sets of test scores.
This can be done whenever the SDs are within acceptable
limits (considering, all factors that may affect the test results, the expected
results, and the intended use of the results). This is IMHO a very subjective
judgment call to be made by the most experienced person available.
There are two other situations: same average test score but
the different SDs are beyond acceptable limits, and both test score and SD
differences are beyond acceptable limits for the two tests. In both cases we
need to equate the two different SDs, the two different distributions of
student scores.

There is then a bias to raise the low SD test to the high SD
test. “The test this year was more difficult than the test last year,” was the NCLB
explanation from Texas, Arkansas, and New York. [It was not that the students
this year were less prepared.]

Mapping a new difficult test to an easier reference test
with the same SD increases the values on the equating line, as well, as
truncates it. Any new test scores over 30 on Chart 52 have no place to be
plotted of the reference test scale.
The equating with an increase in both SD and average test
score expands the distribution and truncates the equating line even more (Chart
52). A comparison of the two above situations as parallel lines (Chart 53)
helps to clarify the differences.
Both increase the new difficult test average test score
value of 20 counts to 30 counts on the reference scale. In this simple example
based on a normal distribution, the remaining values increase in a uniform
manner of equal units of 10 with the same SD and 15 when mapping to the larger
The significance of this is that in the real world, test
scores are not distributed in nice ideal normal distributions. The equating
line can assume many shapes and slopes.
The unit of measure needed to plot an equating chart must
include equivalent portions of the two distributions. Percentage is a
convenient unit: equipercentile equating. [More on this in the next post.]
Whither Test A is the reference test, or Test B is the
reference test, or both are combined as one analysis is the difficult subjective
call of the psychometrician. So much depends on the luck on test day related to
the test blueprint, the item writers, the reviewers, the field test results,
the test maker, the test takers and many minor effects on each of these
This is little different from predicting the weather or the stock
market, IMHO. [The highest final test scores at the Annapolis Naval Academy
were during a storm with very high negative air ion concentrations.] The above factors
also need to include the long list of excuses built into institutionalized
education at all levels.
On a four-option item, chance alone injects an average 25%
value (that can easily range from 15 to 35%) when students are forced to mark
every item on a traditional multiple-choice (TMC) test. Quality is suppressed
into quantity by only counting right marks: Quality and quantity are therefore
linked into the same value. TMC high test scores have higher quality then lower
test scores, but this is generally ignored.
It does not have to be that way. Both the partial credit Rasch model IRT
and Knowledge and Judgment Scoring
permit students to report what they trust they know and can do and what they
have yet to learn accurately, honestly and fairly. No guessing is required. Both
paper tests and CAT tests can accept, “I trust I know or can do this,” “I have
yet to learn this,” and if good judgment does not prevail, “Sorry, I goofed.” Just score 2, 1, and 0 rather than 1
for each right mark (for whatever reason or accident).
A test should encourage learning. The TMC at the lower
scores is punitive. By scoring for both quantity and quality (knowledge and
judgment) students receive separate scores, just as is done on most other
assessments. “You did very well on what you reported (90% right) but you need
to do more to keep up with the class” rather than “You failed again with a TMC
score of 50%.
Classroom practice during the NCLB era tragically followed
the style of the TMC standardized tests conducted at the lowest levels of
thinking. The CCSS tests need to model rewarding students for their judgment as
well as right marks. [We can expect the schools to again doggedly try to imitate.]
It is student judgment that forms the basis for further learning at higher
levels of thinking: one of the main goals of the CCSS movement. The CCSS
movement needs to update its use of multiple-choice to be consistent with its
Equating TMC meaninglessness does not improve the results.
This crippled form of multiple-choice does not permit students to tell us what
they really know and can do that is of value for further learning and
- - - - - - - - - - - - - - - - - - - -
Free software to help you and your students
experience and understand how to break out of traditional-multiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment