The tools psychometricians favor are most sensitive when a
question divides the class into two equal groups of right and wrong. This
situation only exists when scoring traditional multiple-choice (TMC) at one point
in a normal score distribution: at an item difficulty of 50%.
The invention of item response theory (IRT) made
it possible to extend this situation (half right and half wrong) to the full
range of item difficulties. IRT also allows expressing item difficulty and
student ability on the same logit (log odds) scale.
IRT calibrated items make computer adaptive testing (CAT)
possible. Items are grouped by the estimated difficulty that matches the estimated
student ability needed to make a right response 1/2 of the time.
Typically, students must select one of the given options.
Omit, or “I have yet to learn this”, is not an included option. The failure to
include student judgment is a legacy from TMC (see previous posts).
Traditional CAT is therefore limited to ranking examinees.
It is a very efficient way to determine if a student meets expectations based
on a group of similar students. It is the solitary academic version of Family
Feud.
The game is simple. Answer the first question. If right, you
will be given a bit more difficult question. If wrong, you will be given a bit
less difficult question.
If you are consistently right, you finish the test with a
minimum of questions. The same can be said for being consistently wrong.
In between, the computer seeks a level of question that you
get right half of the time. If an adequate number of selections fall within an
acceptable range, you pass, and the test ends. Otherwise the test continues
until a time limit or item count is reached and you fail.
If doing paper tests for NCLB was considered the biggest
bully in the school, CAT increases the pressure. You must answer each question as
it is presented.
You are not permitted to report what you know. You are only
given items that you can mark right about 1/2 of the time. You are in a world
far different from a normal classroom. It is more like competing in the
Olympics.
You are now CAT food. Originality, innovation, and
creativity are not to be found here. Your goal is to feed the CAT the answer
your peer group selected for you as the right answer 1/2 of the time (that is
right, they did not know the right answer 1/2 of the time either).
Playing the game at the 1/2 right level is not reporting
what you trust you know or can do. It is playing with rules set up to maximize
the desired results of psychometricians. Your evaluation of what you know does
not count.
Your performance on the test is not an indication of what
you trust you know and can do, but it is generally marketed as such. This is
not a unique regulatory situation.
Sheila Bair, Chairman of the Federal Deposit Insurance
Corporation, 2006-2011, described the situation in NCLB in terms of bank
regulators, “They confuse their public policy obligations with whether the bank
is healthy and making money or not.” (Charlie Rose, Wed 10/31/2012 11:00pm,
Public Broadcasting System)
Psychometricians confuse their public obligation to measure
what students know and can do with their concern for item discrimination and
test reliability. This has perpetuated TMC, OMC, and CAT using forced-choice
tests. The emphasis has been on test performance rather than on student
performance.
[Local and state school administrators further modify the
test scores to produce an even more favorable end result, expressed as percent
improvement and percent increase by level of performance, and at the same time
they suppress the actual test scores. Just like big bankers gambling with
derivatives!]
IRT bases item calibration on a set of student raw scores.
Items are then selected to produce an operational test of expected performance
from which expected students scores can be mapped. These expectations generally
fail. Corrections are then needed to equate the average difficulty of tests
from one year to the next year.
The Nebraska and Alaska data show that the exact location of
individual student ability is also quit blurred. An attempt to extract individual
growth (2008)
therefore understandingly failed on a paper test, but showed promise using CAT.
CAT is now (2010)
being promoted as a better way than using paper tests to assess individual
growth far from the passing cut score. [Psychometricians have traditionally
worked with group averages, not with individuals.]
Forced-choice CAT, at the 1/2 right difficulty level, is the
most vicious form of naked multiple-choice. Knowledge Factor uses an even higher standard,
but clothes items in an effective instructional system. Also all items assess
student judgment.
The claims that CAT can pin point exactly what a student
knows and does not know are clearly false. CAT can rank a student with respect
to a presumably comparable group.
To actual know what a student knows or can do requires that
you devise a way for the student to tell you. There is a proliferation of ways
to do this that for the most part require subjective scoring. Most are
compatible with the classroom.
My favorite method was to visit with (listen to) a student answering
questions on a terminal. It is only when fully engaged students share their
thinking that you can observe and understand their complete performance. This
practice may soon be computerized and even made interactive given the current
development of voice recognition.
Judgment multiple-choice
(JMC) allows an entire class of students to tell you what they trust they know and can do without
subjective scoring. JMC can be added to CAT. This would produce a standardized
accurate, honest, and fair test compatible with the classroom.
Please
encourage Nebraska to allow students to report
what they trust they know and what they trust they have yet to learn. Blog.
Petition. We need
to foster innovation wherever it may take hold.
No comments:
Post a Comment