Wednesday, October 10, 2012

Your Standardized State Test


A standardized state test is created in the same way as a standardized classroom test (see prior post) with a few exceptions: 1. The initial questions are calibrated and field-tested. 2. The mastery questions are rejected (a standardized state test is not concerned with what students actually know because of the following exception). 3. Only discriminating items are selected for the operational test, to make the operational test as powerful as possible with the fewest number of items, for ranking schools, teachers, and students.

State test results can be parsed by inspection of what happened, at all levels of thinking, in the same way as a classroom test, using scores and portions of scores. However state test results are usually parsed based on standardized expected scores and portions of scores. A set of common items is sprinkled through the distribution of each test. If the common items perform the same on both tests, then the two tests are declared of equal difficulty. Unfortunately this practice does not work like pixie dust. The common items sometimes fail. Florida in 2006 suffered a marked increase (highest value ever reported on the Grade 3 FCAT Reading SSS) followed by a marked decrease the next year in 2007.

How state test results are reported to the public has, therefore, evolved from risky raw scores, to percent passing, to increase over last year, to fairly safe equipercentile equating. (This creativity carries on into how states inflate their educational progress based on their standardized test results.) A method of transition equipercentile equating was initiated with the Grade 3 FCAT 2.0 Reading (2011) test to help solve problems created by ranking students on traditional forced-choice, right count scored tests, when introducing a new form of a test. It is rather clever marketing.

·      2010 Last FCAT test reported in old achievement levels.
·      2011 First FCAT 2.0 test reported in old achievement values.
·      2012 First FCAT 2.0 test reported in new achievement levels.
·      2012 Second FCAT 2.0 test reported in new achievement levels.

FCAT Reading – Sunshine State Standards
Year
% Achievement Level
Level 3-5
2010 old
16
12
33
31
8
72
2011 old
16
12
33
31
8
72
2011 new
18
25
23
24
10
57
2012 new
18
26
23
22
11
56

“The scores are being reported in this way to maintain consistent student expectations during the transition year.” There is no delay in publishing results in the transition year of 2011; just parse the 2011 scores with the 2010 achievement levels. The Department of Education then has one year to get all interested parties together to create the new achievement levels.

 “Although the linking process does not change the statewide results for this year [2011], it does provide different results for districts, schools, and students.” The 2012 results confirm the 2011 results. This looks good. This stability is highly prized as an indicator that the Department of Education is doing a good job in a difficult situation.

However when equipercentile equating was used on the Grade 4 writing test it created a furor. When announced in advance with lots of time for all interested parties to maneuver, equipercentile equating was acceptable on the Grade 3 reading test. When applied as a stopgap measure on the Grade 4 writing test, it failed. The rankings from test scores are therefore a very political matter: the right portion must pass and fail rather than the test being a measure of some identified student ability.

The Center on Education Policy (CEP) sent an open letter to the member states of SBAC and PARCC, 3 May 2012 suggesting: “Routinely report mean (average) scores on your assessments for students overall and for each student subgroup at the state and local levels, as well as across the consortium. This should be done in addition to reporting the percentages of students reaching various achievement levels.” We need creative teaching, not creative manipulation of test results.

In conclusion, standardized state tests are now much closer to standardized classroom tests. Reasonable attempts are made to select questions that will produce a workable distribution for ranking students, teachers, and schools. The classroom teacher is replaced with committees of experts. The test results are then inspected to see what happened by another set of committees of experts just as a teacher would inspect classroom results at all levels of thinking. (The state has one year to do what a classroom teacher does in one hour.)

The largest remaining failure in all of this, IMHO, is that all of this work is being done using a scoring method that functions at the lowest levels of thinking: the right count scored multiple-choice test. Although examiners are now giving themselves the opportunity to use their best judgment, at all levels of thinking, in interpreting test scores (as classroom teachers always have), they have yet to give students the opportunity to use their best judgment, at all levels of thinking, to mark answers they trust as the basis for further learning and instruction.

To obtain accurate, honest and fair results, students must be given the opportunity to report what they trust – no guessing required. It only takes a change in test instructions. PUP, Winsteps, and Amplifire can score a multiple-choice test at all levels of thinking. If we want students to be skillful bicycle riders, we must stop testing them only on tricycles.

No comments:

Post a Comment