14
Adding more items with the same difficulties to a perfect
world test on the VES Engine (Table 18) did not change the average student
score (50%), the standard deviation (SD of 15.39%), the standard error (SE of
3.44%) or the average item discrimination (PBR of 0.30). The test reliability
(KR20) improved but the standard error of measurement (SEM) made a marked
improvement (Chart 35).
This makes sense. The more items on a test the greater the
test reliability; the greater the test reliability, the smaller the range into
which repeated student testing scores can be expected to fall. By doubling the
number of items, twice, 20 to 80 items, the SEM fell from 5.39% to 2.64%. By
doubling twice again to 320 items, the value again was reduced by half to
1.32%.
The common core state standards (CCSS) movement is now
bringing into practice testing with an average difficulty of 50%. This
optimizes test performance, but bullies students.
A class of 20 students, IMHO, can produce usable results if eight
40-item tests are used during the course. With a SEM of 1.32%, scores from the
same student would only need to be 1.32% x 3 = 3.94% apart to show acceptable
improvement in performance.
Testing companies can then market a single test, with a
total number of items from 80 to 160, which will rank students and teachers
with acceptable precision based on test scores. Each student will have to read
every item on paper. Computer adaptive testing (CAT) will generally require
less than that number, which means, CAT students will not take the same test.
Again testing is optimized for the testing companies who are
only being required to rank students. They can calibrate items on a group of
representative students. They can then present different items, but comparable only in difficulty, as equivalent
items. This only makes sense if every student has the same general background
and preparation and is an average student with average luck on test day. The
practice reduces individuality and eliminates creativity. It does not have to
be that way.
Armed with the above ability to rank students, testing
companies are also marketing more tests: formative, summative, and in between
“submative” (neither formative nor summative). The same items can be used on all
three. The difference is that the formative process takes place in such a
timely manner that the student learns (in seconds to minutes at higher levels of
thinking and in minutes to days at lower levels of thinking). The summative test
measures what has happened, not what is being learned at the moment.
The “submative” test falls in between as a subtest, but
again measures the past. IMHO it also hints that buying such a test is better,
in the short term, for school administrators, than letting a good teacher
assess in a normal classroom. Relying on short term, lower level of thinking,
tests that only rank students does not promote the development students need to
become successful self-educable high quality achievers. (CCSS movement
multiple-choice test questions may be highly contrived requiring considerable
problem solving skills, but are still scored easier than a bingo operation:
good luck on finding the right answer, with 1/4 free instead of 1/25 free.)
It does not have to be that way. The very same items can be
scored to promote student development; function as formative experiences, and
provide immediate guidance for teaching. Just because testing companies can
deliver high quality rankings does not mean we should limit the return on the
time and money invested (by students, teachers, and tax payers) to just ranking.
This cripples schooling. The decade of NCLB experiences present the evidence
here.
As suggested in the previous post, we need more than 20 test
items and IMHO a test scored for what students trust they actually know and can
do such as Power UP Plus by Nine-Patch
Multiple-Choice, partial-credit Rasch model by
Winsteps and Amplifire by
Knowledge Factor.
- - - - - - - - - - - - - - - - - - - -
-
Free software to help you and your students
experience and understand how to break out of traditional-multiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
[I again checked the test reliability values with the
Spearman-Brown prophecy formula (Table 19). At this high end of the range, they
closely matched the results from the VES Engine. The test with 20 items made
four predictions that were increasingly close to the
observed (x1) test reliability.]
No comments:
Post a Comment