17
Standardized test makers use statistics to predict what may
happen; classroom statistics describe what has happened. Classroom tests
include two or three dozen students. Standardized test making requires several
hundred students. Classroom tests are given to find out what a student has yet
to learn and what has been learned. Standardize tests are generally given to
rank students based on a benchmark test sample. Classroom and standardized
tests have other significant differences even though they may use many of the
same items.
I took the two classroom charts (37 and 38 in a previous post) and extended the standard deviations (SD)
from 5 - 10%, to 10 - 30%; a more realistic range for standardized tests (Chart
44). At a 70% average score and 20% SD the normal curve plots of 40 students by
40 items started going off scale. I then reversed the path back to the original
average score of 50% as the SD rose from 20% to 30%.
The test reliability (KR20) continued to rise with the SD for
these normal distributions set for maximum performance. The item discrimination
(PBR) rose slightly. The relative SEM/SD value decreased (improved) from 0.350
to 0.157 as test reliability increased (improved).
The two tests with average test scores of 50% yielded very
different test reliability and item discrimination values for SD values of 10%
and 30% on Chart 44; the greater the distribution spread, the higher the KR20
and PBR values. [I plotted the N – 1 SD to show how close the visual education
statistics engine (VESE) tables were to their expected normal curves.]
The SD is then a key indicator of test performance; the
spread of the student score distribution, the main goal for standardized test
makers. It is also very sensitive to extreme values. The 30% SD plot was made
by teasing the VESE table that I set for 30% SD. The original SD value was near
that for a perfect Guttman table (each student score and each item difficulty
appear only once), about 28%. By moving four pair of marks, near the extreme
ends of the distribution, one count more toward the end, the SD rose to 30%.
That is moving four pair of marks out of 400 pair one count each to change the
SD by 2%.
The standard error of measurement (SEM) under optimum normal
test conditions remained about 4.4% (Chart 44). So, 4.4 x 3 = 13.2%. A
difference in a student’s performance of more than 13.2% would be needed to
accept the scores as representing a significant improvement with a test
reliability of 0.95. All of the above mark patterns were not mixed; which is an
unrealistically optimum performance.
I looked again at the effect of mixing right and wrong marks
on an item mark pattern with a higher SD value than found in the classroom
(Chart 45). The change from a SD of 10% to 20% was much smaller than I had anticipated.
The effect of deeper mixing was again linear.
Average item difficulty sets limits on the maximum PBR that
can be developed (Chart 46). In a perfect world where all items are marked
either all right or all wrong, the maximum PBR is 1.0 for individual items.
Looking back at prior posts, I found lower values on a
perfect Guttman table (0.84) and a normal curve table set at 30% SD (0.85). The
PBR declined along with the SD set to 20% and 10% (Chart 46).
These values hold
for tests with average test scores that range from 50% to 70%.
There is now
enough information to construct the playing field upon which psychometricians
play (Chart 47). I chose two
scoring configurations: Perfect World and Normal Curve with a SD of 20%. The
area in which standardized tests exit is a small part of the total area that
describes classroom tests. The average student score and item difficulty were set
at 50%.
An item mark
pattern at 50% difficulty can produce a PBR of 1.0 in a perfect world (blue).
All right marks are together and all wrong marks are together. The PBR drops to
zero with complete mixing (Table 20). It falls to a -1.0 when all right marks
are together at the lower end of the mark pattern.
The area for the normal curve distribution (red) with a SD
of 20% fits inside the perfect world boundary. This entire area is available to
describe classroom test items. Items that are easier or more difficult than 50%
reduce the maximum possible PBR. They have shorter mark patterns. And here too,
fully mixed patterns drop the PBR to zero.
We can now see
the problem psychometricians face in making standardized tests. The
standardized test area is about 1/8th of the classroom area.
Standardized tests never use negative items (that almost excludes
misconceptions which cannot be distinguished from difficult items using
traditional multiple-choice scoring; as they can using Knowledge and Judgment Scoring).
Chart 44 indicates an average PBR of over 0.5 is need for
the desired test reliability of over 0.95 under optimum conditions (no mark
pattern mixing). With just ¼ mixing, the window for usable items becomes very
small. The effect of mixing right and wrong marks on an item mark pattern
varies with item difficulty. A test averaging 75% right with unmixed items
would be the same as a test averaging 50% right with partially mixed items.
A 2008 paper from Pearson, by Tony
D. Thompson, confirms this situation. “This variation, we argue, likely
renders non-informational any vertical scale developed from conventional
(non-adaptive) tests due to lack of score precision” (page 4). “Non-informational”
means not useful, not valid, does not look right, and does not work, IMHO.
“Conventional” means, in general, paper tests and the fixed form tests being
developed by PARCC for online delivery for the Common Core State Standards
(CCSS) movement.
This comment may be valid for “many educational tests” (page 14). “Also, if an individual’s
observed growth is much larger than the associated CSEM, then we may be
confident that the individual did experience growth in learning.” This
indicates that using simulations within the playing field, as Thompson did,
confirms my exploration of the limits of the playing field. [And the CSEM,
which is applied to each score, is more precise than the SEM based on the
average test score.]
“While a poorly constructed vertical scale clearly cannot be
expected to yield useful scores, a well-defined vertical scale in and of itself
does not guarantee that reported individual scores will be precise enough to be
support meaningful decision-making” (page 28). This cautionary note was written
in 2008, several years into the NCLB era.
The VESE tables indicate that the “best we can do” is not
good enough to satisfy marketing department hype (claims). Testing companies
are delivering what politicians are willing to pay for: a ranking of students,
teachers, and administrators only based on a test producing scores of
questionable precision. Additional use of these test scores is problematic.
An unbelievable situation is currently being challenged in
court in Florida.
Student test scores were used to “evaluate” a teacher who never had the
students in class! It reveals the mind set of people using standardized test
scores. They clearly do not
understand what is being measured and how it is being measured. [I hope I do by
the end of this series.] Just because something has been captured in a number does not mean that the number controls that something.
Scoring all the data that can be in the answer sheets would
provide the information (which is repeatedly sought but ignored in traditional
multiple-choice) needed to guide student, teacher and administrator
development. Schools designed for failure (“Who can guess the answer?”), fail.
Schools designed for success have rapid, effective, feedback with student
development (judgment) held as important as knowledge and skills. Judgment
comes from understanding, a goal of the CCSS movement.
-
- - - - - - - - - - - - - - - - - - - -
Free software to help you and your students
experience and understand how to break out of traditional-multiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment