Friday, June 25, 2010

Understanding and Trusting NCLB Test Standards - TAKS




After eight years there is still a problem with people not understanding and trusting NCLB standardized testing according to the Texas TAKS social studies grade 8 article, “Qualms arise over TAKS standards”, in the Houston Chronicle by Ericka Mellon, 7 June 2010.

‘State Rep. Scott Hochberg, vice chairman of the House Public Education Committee said in the Houston Chronicle, “You can get more than halfway to passing just by guessing”.’

A distribution of expected lucky scores from the test with 48 questions and 4-option answers shows this to be correct, on average.  

‘TEA Deputy Associate Commissioner Gloria Zyskowski said agency officials set the bar high enough so “students can’t pass the test by chance alone.”’

One very lucky student out of a 100 needs to add 2 right marks to pass. One very unlucky student out of a 100 needs to add 17 right marks to pass. Students cannot pass the test by luck alone. The unfairness of students starting the test with lucky scores ranging from 5 to 19 is not very important on the this test as 95 percent passed the test with an average score over 80%. [YouTube]

‘Sarah Winkler, the president of the Texas Association of School Boards, was shocked to find out Monday that the TEA doesn’t set the passing bar – called the cut score – until after students take the TAKS.’

This practice takes the TEA out of the game. They no longer have to make a bet on what the cut score should be (and to respond to all of the ramifications if they are wrong). They can bring all their expertise to bear on setting the most appropriate cut score. An operational test is not governed by research rules and hypothesis testing of average scores.

Operational testing is concerned about each student when the results determine pass or fail a grade. Is this case, and especially when low cut scores are used, it would be nice if students could also get out of the game of right count scoring (guess testing). The TEA can do this by using Knowledge and Judgment Scoring
that lets all students start the test at the same score and gives equal value to what they know and to the judgment needed to make use of what they know. It assesses all levels of thinking, an innovation ready for the next revision of NCLB. The social studies test yielded an average score over 80%. The TEA could also use Confidence Based Learning scoring that functions at the mastery level.

‘“We didn’t do anything differently than previous years,” said TEA spokeswoman Debbie Ratcliffe. “It wouldn’t be fair to kids if this test wasn’t at the same difficulty level from year to year.”’

The test characteristic curves, used by psychometricians, for the eight years bare this out. The curves for six of the eight years fall directly on top of one another with a cut score of 25. This is an outstanding piece of work. The year 2005 shows a slight deviation (cut score of 24) and 2010 a much greater deviation in difficulty (cut score of 21). The minute breaks in the scale scores at 2100 and 2400 are the standards for met and commended performance levels. (PLEASE NOTE that these curves descend to zero on tests that are designed to generate a lowest lucky score of 12 out of 48 questions, on average. This is no problem for true believers.)

‘TEA officials say the questions, for the most part, were harder this year, so they followed standard statistical process and lowered the number of items students needed to get correct.’ But were the questions harder or the students less prepared?

The TEA is faithfully following the operating rules that come with their Rash one-parameter IRT model analyzer (ROPIRT). For the thoroughly indoctrinated true believer a ROPIRT works like a charm in a space with arbitrary dimensions. There is a mysterious interaction between the average score of a set of anchor questions embedded in each test, the average right count test score, the cut score, and the percent passing, on each test, and with the preceding test, within the ROPIRT. Only the last two or three are generally posted on the Internet. For the rest of us, we must judge its output by the results it returns to the real world.

The eight-year run of the social studies grade 8 test shows some informing behavior. Years 2003 and 2004 were originally assigned cut scores of 19 and 22. That yielded passing rates of 93% and 88%. Later, all years were assigned a cut score of 25 except for 2005 (24) and 2010 (21). Now to weave a story with these facts.

Starting in 2003 with a cut score set at 25, 77% passed the test with an average test score of 65.5%. The average test score increased by 5.2% in 2004 to 70.8%. This was not enough to trigger a change in the cut score. The passing rate increased to 81%.

The average test score remained stationary in 2005. This triggered a 4% change in the cut score by one count from 25 to 24. The ROPIRT decided that the test was more difficult this year so the passing rate should be adjusted up from 81 to 85%.

The average test score increased by 4.2% in 2006 to 75%. This triggered a 4% change in the cut score by one count from 24 back to 25. The ROPIRT decided that the test was too easy this year so the passing rate should be adjusted down from 85 to 83%.

The average test score increased by lesser amounts in 2007, 2008, and 2009 (3.1, 2.1, and 2.1%). These did not trigger an adjustment in the cut score.

In 2010, the average test score decreased by only 2.2% to 80.2%, the same average score as in 2008. The ROPIRT decided the test was way too difficult by changing the cut score by 4 counts from 25 to 21. This was a 16% adjustment in cut score for a 2.2% change in the average test score.

The amount of the adjustment is not consistent with the previous adjustments. The resulting passing rate for 2010 of 95% is not consistent with the passing rate for 2008 (90%) with the same average test score. The ROPIRT (which to my knowledge) only looks back one test, is drifting away from previous decisions it has made. [YouTube

The Texas data show four interesting things:

  1. If students do too well on a test, it is declared too easy and the cut score is raised to lower the pass rate even though they may have actually performed better.
  2. If students do too poorly on a test, it is declared too difficulty and the cut score is lowered to raise the pass rate even though they may have actually performed poorly.
  3. If the above goes on long enough, the whole process drifts away from the original benchmark values and requires recalibration.
  4. A benchmark cut score can be revised based on the results of following years. This is consistent with the ROPIRT operating instructions to remove imperfect data until you get the right answer. 

Calibrating questions with a ROPIRT for use in time saving computer assisted testing (CAT) is valid. Using it to equate tests over a period of seven years is another matter. By design (an act of faith), a ROPIRT cannot error as it lives in a perfect world of true scores (these error free true scores, the raw scores found on the raw score to scale score conversion tables, are generally considered to be on the same scale as the right count test scores even though each student’s right count test score is influenced by a number of factors including item discrimination and lucky scores). Error occurs when imperfect data are fed into a ROPIRT that are not manually detected and removed. The blame game then ends with operator inexperience. Since Texas is using a Rasch Partial-Credit Model in a ROPIRT mode, it could use Knowledge and Judgment Scoring to reduce the error from traditional right count scoring.

For someone outside a State Department of Education to assess the operation of their ROPIRT, the investigator would need a minimum set of information for each year of the test: The mean of the anchor set of questions embedded in each test that is the primary determiner of the change in the cut score, the mean of the right count scored student tests, the cut score, and the percent passing. I have yet to find a state that posts or will provide all four of these values. Texas posts the last three. Arkansas posts the last two.

Are the test results politically influenced? From the data in hand, I don’t know enough to know. High scores (now high pass rates that are sensitive to low scores) are needed to meet federal standards. The shape of the gently ever more slowly rising curve for the passing rate appears to be more carefully choreographered than to be a direct result of student performance for several states. I think a better question is: Is this from political influence or the result of a smoothing effect created when using (and learning to use) a ROPIRT? The revised passing rates for 2003 and 2004 on the social studies grade 8 test give us a mixed clue.