The history of accountability from the school and to the
state department of education level has been quit varied. Only after running
this preposterous natural experiment for ten years is it being challenged in
ways that may be effective in either bringing it to an end or in correcting its
excesses. Congress created this absurd monstrosity by setting an impossible
goal for all students to meet. It then reneged on its oversight responsibility
to act in good faith to avoid many of the unintended consequences that occurred
(it is now five years behind the time it should have acted on needed changes).
Self-regulation is a lofty idea. It has failed miserably for
mortgages, for wall street derivatives, and in the futures market (all of which
were presumably being regulated). The same can be expected in state departments
of education that must come up with acceptable numbers to obtain federal
funding. The two consortia (PARCC and SBAC) promoting the Common Core State Standards
have the promise of serving as checks on one another. This is an expensive and
ambitious political solution that may have its own down side depending on
implementation. If all states will release actual student test scores there
will be a way to determine how creative states are in determining passing
rates.
The passing rate has been politically exploited in several states.
New York is the prize example. Diane Ravitch posted, 21 February 2012, “Whence
came this belief in the unerring, scientific objectivity of the tests? Only 18
months ago, New York tossed out its state test scores because the scores were
unreliable. Someone in the state education department decided to lower the cut
scores to artificially increase the number of students who reach proficient. No
one was ever held responsible.”
Michael Winerip posted, 10 June 2012, “Though this may be
the worst breakdown in 15 years of state testing, it does not appear that
Florida politicians have any interest in figuring out who was responsible. The
commissioner? Department officials? Someone at Pearson, the company that scored
the writing tests?” Winerip reports further that, “The audit referred to
lowering the passing score to 3 as ‘equipercentile equating’”. That is, the
score was lowered until the same portion of students passed this year as passed
last year. [As I am writing this, the commissioner resigned.]
As is the case with mortgages, derivatives, and futures, it
is difficult, in most cases, to say if a crime has been committed or just very
poor judgment was exercised, until the chain of events is carefully studied
including the false "belief in unerring" test scores. My own explanation here
is that research results and application results are not the same thing. In
research you predict acceptable results. In application you examine the results
for meaningful useful relationships (equipercentile equating to obtain the
desired pass rate – any relationship between the ranking on the test and what
students actually know or do is mostly coincidental near the cut score).
Has a crime been committed? In the case of New York, and
other states that must now “explain” why student tests scores are dropping on
new tests, I would say, “Yes.” For states that choreographered an almost
perfect, ever slowing, increase in the pass rate for the past 8 to 10 years,
the answer is problematic. It can range from outright cheating to self-deception.
From equipercentile equating, to selecting test items that produce the desired
results, the standard practice for classroom test score management.
On 18 May 2012, Valerie Strauss posted the white paper
released by the Central Florida School Board Coalition. This lengthy paper
details the unintended consequences and the downright sloppy test items used on
their standardized tests. My own software, Power Up Plus (PUP) can pick such
items out when run on a notebook computer in a matter of minutes. I am amazed
that such items are used considering the millions of dollars spent on
development and administration of their tests. I strongly suspect their
developmental process.
Cory Doctorow posts, “The Test Item Specifications are the
guidelines that are used to write the test questions. If the Science FCAT test
is reviewed by the same Content Advisory Committee that reviewed the Test Item
Specifications, then it probably has similar errors.” From my experience, a valid
test item must assess exactly what it says (concrete level of thinking – what
you see is what you get) or be an indicator of knowledge or skill of things in
the same class (1 + 3 = 4 to assess addition of integers). Questions that have
different right answers based on levels of thinking, socio-economic status, state, religion, politics, ethnicity, and current political correctness are not to be
used. That is, stick to the topic, not to what the topic (agenda) or skill may
be used for. Where a question is on topic but has different answers related to the above, it should stand. This is part of the broadening effect of education. A recent example is the Missouri constitutional amendment voted on yesterday to protect the religious rights of school children.
In the case of Florida, once again, faulty predictions were
made based on some type of research. The entire system (instruction, learning,
and assessment) was not fully understood or coordinated with disastrous
results. And again, another state education official has resigned. Was this a
crime or just a waste of millions of dollars and millions of instructional and
learning hours?
On 24 April 2012
Valerie Strauss posted the National Resolution Protesting High-stakesStandardized Testing that is based on the Texas Resolution Concerning High Stakes,
Standardized Testing of Texas Public School Students. These two resolutions
combined with the Florida white paper make a strong political protest that may
take years to obtain results. The desired results are not specifically stated.
We are back to the days of “alternative assessment”: Do something different; but
doing something different must be at least as good as what we have, or we lose
again; as with the authentic assessment and the portfolio movements.
As well intentioned as all the people are working on this
assessment problem, most are still riding their safe steady tricycle, the
traditional force-choice multiple-choice test scored at the lowest levels of
thinking, that they were exposed to long before it was used for NCLB testing.
It actually worked fairly well back then when the average test score was 75% or
higher. It has failed miserably when NCLB test score cut scores dropped below
50% (a region where luck of the day determines the rank of pass or fail). Only
until these people are willing to get off their old tricycles will they have
any interest in getting on a bicycle (where students can actually report what
the know and can do).
Knowledge and Judgment Scoring allows students to
individualize standardized tests, to select the level of thinking they will use;
to guess at right answers, or to use the questions to accurately report what they
actual know and can do. We only need to change the test instructions from,
“Mark an answer on each question, even if you must guess” to “Mark an answer
only if you can use the question to report something you trust you know or can
do.” Change the scoring from “two points per each right mark and zero for each
wrong mark” to “zero for each wrong mark, one point for omit (good judgment not
to guess – to make a wrong mark) and two points for each right mark (good
judgment and right answer)”.
We can now give students the same freedom as given with essay,
project, or report to tell us what they actually know and can do. We also have
the option to commend students as on other alternative tests, “You did a great
job on the questions you marked. Your quality score of 90% is outstanding.”
This quality score is independent of the quantity score. You can now honestly
encourage traditionally low scoring students for what they can do rather than
berate them for what they cannot do (or for their bad luck on traditional
forced-choice tests).
The NCLB monster may be controlled by legal action (see
prior post), political action, or by just changing its temper from a
frightening bully to an almost friendly companion. Just select your breed: Knowledge and Judgment Scoring in PUP, Partial Credit Model in
Winsteps, or Amplifier by
Knowledge Factor.
Why continue just counting right marks; making unprepared
liars out of lucky winners and misclassifying for remediation unlucky losers?
We know better now. It no longer has to be that way. “Whence came this belief
in the unerring, scientific objectivity of the [dumb, forced-response, guess]
tests?” We need to measure what is important. We do not need to make an easy
measurement (forced multiple-choice and essay at the lowest levels of thinking)
and then try to make the results important (at higher levels of thinking).
There is a difference in student performance between riding a tricycle and a
bicycle. We cannot hold students responsible for bicycling if they only practice and test on tricycles.