A
standardized state test is created in the same way as a standardized classroom
test (see prior post) with a few exceptions: 1. The initial questions are
calibrated and field-tested. 2. The mastery questions are rejected (a
standardized state test is not concerned with what students actually know
because of the following exception). 3. Only discriminating items are selected
for the operational test, to make the operational test as powerful as possible with the
fewest number of items, for ranking schools, teachers, and students.
State test
results can be parsed by inspection of what
happened, at all levels of thinking, in the same way as a classroom test,
using scores and portions of scores. However state test results are usually
parsed based on standardized expected
scores and portions of scores. A set of common items is sprinkled through the
distribution of each test. If the common items perform the same on both tests,
then the two tests are declared of equal difficulty. Unfortunately this
practice does not work like pixie dust. The common items sometimes fail.
Florida in 2006 suffered a marked increase (highest value ever reported on the Grade
3 FCAT Reading SSS) followed by a marked decrease the next year in 2007.
How state
test results are reported to the public has, therefore, evolved from risky raw
scores, to percent passing, to increase over last year, to fairly safe
equipercentile equating. (This creativity carries on into how states inflate
their educational progress based on their standardized test results.) A method
of transition equipercentile equating was initiated with the Grade 3 FCAT 2.0
Reading (2011) test to help solve problems created by ranking students on
traditional forced-choice, right count scored tests, when introducing a new
form of a test. It is rather clever marketing.
·
2010 Last FCAT test reported in old achievement levels.
·
2011 First
FCAT 2.0 test reported in old
achievement values.
·
2012 First
FCAT 2.0 test reported in new
achievement levels.
·
2012 Second FCAT 2.0 test reported in new achievement levels.
FCAT Reading – Sunshine State Standards
|
||||||
Year
|
% Achievement Level
|
Level 3-5
|
||||
2010 old
|
16
|
12
|
33
|
31
|
8
|
72
|
2011 old
|
16
|
12
|
33
|
31
|
8
|
72
|
2011 new
|
18
|
25
|
23
|
24
|
10
|
57
|
2012 new
|
18
|
26
|
23
|
22
|
11
|
56
|
“The scores
are being reported in this way to maintain consistent student expectations
during the transition
year.” There is no delay in publishing results in the transition year of
2011; just parse the 2011 scores with the 2010 achievement levels. The
Department of Education then has one year to get all interested parties
together to create the new achievement levels.
“Although the linking process does not
change the statewide results for this year [2011], it does provide
different results for districts, schools, and students.” The 2012 results
confirm the 2011 results. This looks good. This stability is highly prized as
an indicator that the Department of Education is doing a good job in a
difficult situation.
However when
equipercentile equating was used on the Grade 4 writing test it created a
furor. When announced in advance with lots of time for all interested parties
to maneuver, equipercentile equating was acceptable on the Grade 3 reading
test. When applied as a stopgap measure on the Grade 4 writing test, it failed.
The rankings from test scores are therefore a very political
matter: the right portion must pass and fail rather than the test being a
measure of some identified student ability.
The Center
on Education Policy (CEP) sent an open letter
to the member states of SBAC and PARCC, 3 May 2012 suggesting: “Routinely
report mean (average) scores on your assessments for students overall and for
each student subgroup at the state and local levels, as well as across the
consortium. This should be done in addition to reporting the percentages of
students reaching various achievement levels.” We need creative teaching, not
creative manipulation of test results.
In
conclusion, standardized state tests are now much closer to standardized
classroom tests. Reasonable attempts are made to select questions that will
produce a workable distribution for ranking students, teachers, and schools.
The classroom teacher is replaced with committees of experts. The test results
are then inspected to see what happened by another set of committees of experts
just as a teacher would inspect classroom results at all levels of thinking.
(The state has one year to do what a classroom teacher does in one hour.)
The largest
remaining failure in all of this, IMHO, is that all of this work is being done
using a scoring method that functions at the lowest levels of thinking: the
right count scored multiple-choice test. Although examiners are now giving
themselves the opportunity to use their best judgment, at all levels of
thinking, in interpreting test scores (as classroom teachers always have), they
have yet to give students the opportunity to use their best judgment, at all levels
of thinking, to mark answers they trust as the basis for further learning and
instruction.
To obtain
accurate, honest and fair results, students must be given the opportunity to
report what they trust – no guessing required. It only takes a change in test
instructions. PUP, Winsteps, and Amplifire can score a multiple-choice
test at all levels of thinking. If we want students to be skillful bicycle
riders, we must stop testing them only on tricycles.
No comments:
Post a Comment