Wednesday, December 19, 2012

Nebraska Assessment Four Star Update

The 2012 NeSA Technical Report contains the information needed to complete (and make corrections on) the Grade 3 Reading Performance chart. The reported portion passing was 76% for Grade 3 (Nebraska Accountability/NeSA Reading/Grade 3).

The observed average score reported was 70%. The estimated expected score was about 66% [No calibration values were given for 10 fairly easy items that were at the beginning of the 2012 test].

The students did better in 2012 on a test that may have been more difficult than in 2011. [The lack of the calibration data on 10 easier items is critical to verifying what happened.]

[All students were presented with all 45 questions. Although the test was taken online, it was not a computer adapted test (CAT). The test design item difficulty was 65%, which is 15% above CAT design (50%).

“Experience suggests that multiple choice items are effective when the student is more likely to succeed than fail and it is important to include a range of difficulties matching the distribution of student abilities (Wright & Stone, 1979).” (2012 NeSA Technical Report, page 31)

The act of measuring should not alter the measurement. The Nebraska test seems to be a good compromise between what psychometricians want to optimize their calculations and what students are accustomed to in the classroom. CAT at 50% difficulty is not a good fit.]

Fifteen common (core) items were used in all three years: 2010, 2011, and 2012. They are remarkably stable. It testifies to the skill of the test creators to write, calibrate, and select items that present a uniform challenge over the three years.

It also shows that little has changed in the entire educational system (teach, learn, assess) with respect to these items. [Individual classroom successes are hidden in a massive collection of several thousand test results.]

My challenge to Nebraska to include student judgment on standardized tests resulted in about the same number of hits on this blog as the letters mailed. No other contact occurred.

This means that standardized testing will continue counting right marks that may have very different meanings. At the lowest levels of thinking, good luck on test day will be an important contributing factor for passive pupils to pass a test where passing requires a score of 58% on a scale with a mean at 70%.

Students able to function at higher levels of thinking but with limited opportunity to prepare for the test will not be able to demonstrate the quality of what they do know or can do. Both groups will be ranked by right marks that have very different meanings.

The improvement in reading seen in the lower Nebraska grades (Nebraska Accountability/NeSA Reading) failed to carry over into the higher grades. Effective teachers can deliver better prepared students functioning at lower levels of thinking at the lower grades. [Student quality becomes essential at higher levels of thinking in the higher grades.]

Typically the rate of increase in test scores decreases with each year (average Nebraska Grade 3 scores of 65%, 68% and 69% on the 15 common items) where classrooms and assessments function at lower levels of thinking. Students and teachers need to break out of this short-term-success trap.

[And state education officials need to avoid the temptation many took in the past decade of NCLB testing to produce results that looked right. It is this troubled past that makes the missing expected item difficulty values for 10 of the easier 2012 test items so critical.]

The Common Core State Standards movement is planning to avoid the short-term-success trap. Students are to be taught to be self-correcting: question, answers, and verify. Students are to be rewarded for what they know and can do and for their judgment in using that knowledge and skill.

Over the long term students are to develop the habits needed to be self-empowering and self-assessing. These habits function over the long term, in school and in the workplace. They provide the quality that is ignored with traditional right count multiple-choice tests. In school, if you do not reward it, it does not count.

The partial credit Rasch model and Knowledge and Judgment Scoring allow students to elect to report what they trust they know and can do as the basis for further instruction and learning. Quantity and quality are both assessed and rewarded.

Nebraska can still create a five star standardized test.

Seasons Greetings and a Happy New Year!

Wednesday, December 12, 2012

Pearson Computer Adaptive Testing (CAT)

The tools psychometricians favor are most sensitive when a question divides the class into two equal groups of right and wrong. This situation only exists when scoring traditional multiple-choice (TMC) at one point in a normal score distribution: at an item difficulty of 50%.

The invention of item response theory (IRT) made it possible to extend this situation (half right and half wrong) to the full range of item difficulties. IRT also allows expressing item difficulty and student ability on the same logit (log odds) scale.

IRT calibrated items make computer adaptive testing (CAT) possible. Items are grouped by the estimated difficulty that matches the estimated student ability needed to make a right response 1/2 of the time.

Typically, students must select one of the given options. Omit, or “I have yet to learn this”, is not an included option. The failure to include student judgment is a legacy from TMC (see previous posts).

Traditional CAT is therefore limited to ranking examinees. It is a very efficient way to determine if a student meets expectations based on a group of similar students. It is the solitary academic version of Family Feud.

The game is simple. Answer the first question. If right, you will be given a bit more difficult question. If wrong, you will be given a bit less difficult question.

If you are consistently right, you finish the test with a minimum of questions. The same can be said for being consistently wrong.

In between, the computer seeks a level of question that you get right half of the time. If an adequate number of selections fall within an acceptable range, you pass, and the test ends. Otherwise the test continues until a time limit or item count is reached and you fail.

If doing paper tests for NCLB was considered the biggest bully in the school, CAT increases the pressure. You must answer each question as it is presented.

You are not permitted to report what you know. You are only given items that you can mark right about 1/2 of the time. You are in a world far different from a normal classroom. It is more like competing in the Olympics.

You are now CAT food. Originality, innovation, and creativity are not to be found here. Your goal is to feed the CAT the answer your peer group selected for you as the right answer 1/2 of the time (that is right, they did not know the right answer 1/2 of the time either).

Playing the game at the 1/2 right level is not reporting what you trust you know or can do. It is playing with rules set up to maximize the desired results of psychometricians. Your evaluation of what you know does not count.

Your performance on the test is not an indication of what you trust you know and can do, but it is generally marketed as such. This is not a unique regulatory situation.

Sheila Bair, Chairman of the Federal Deposit Insurance Corporation, 2006-2011, described the situation in NCLB in terms of bank regulators, “They confuse their public policy obligations with whether the bank is healthy and making money or not.” (Charlie Rose, Wed 10/31/2012 11:00pm, Public Broadcasting System)

Psychometricians confuse their public obligation to measure what students know and can do with their concern for item discrimination and test reliability. This has perpetuated TMC, OMC, and CAT using forced-choice tests. The emphasis has been on test performance rather than on student performance.

[Local and state school administrators further modify the test scores to produce an even more favorable end result, expressed as percent improvement and percent increase by level of performance, and at the same time they suppress the actual test scores. Just like big bankers gambling with derivatives!]

IRT bases item calibration on a set of student raw scores. Items are then selected to produce an operational test of expected performance from which expected students scores can be mapped. These expectations generally fail. Corrections are then needed to equate the average difficulty of tests from one year to the next year.

The Nebraska and Alaska data show that the exact location of individual student ability is also quit blurred. An attempt to extract individual growth (2008) therefore understandingly failed on a paper test, but showed promise using CAT.

CAT is now (2010) being promoted as a better way than using paper tests to assess individual growth far from the passing cut score. [Psychometricians have traditionally worked with group averages, not with individuals.]

Forced-choice CAT, at the 1/2 right difficulty level, is the most vicious form of naked multiple-choice. Knowledge Factor uses an even higher standard, but clothes items in an effective instructional system. Also all items assess student judgment.

The claims that CAT can pin point exactly what a student knows and does not know are clearly false. CAT can rank a student with respect to a presumably comparable group.

To actual know what a student knows or can do requires that you devise a way for the student to tell you. There is a proliferation of ways to do this that for the most part require subjective scoring. Most are compatible with the classroom.

My favorite method was to visit with (listen to) a student answering questions on a terminal. It is only when fully engaged students share their thinking that you can observe and understand their complete performance. This practice may soon be computerized and even made interactive given the current development of voice recognition.

Judgment multiple-choice (JMC) allows an entire class of students to tell you what they trust they know and can do without subjective scoring. JMC can be added to CAT. This would produce a standardized accurate, honest, and fair test compatible with the classroom.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, December 5, 2012

Pearson Ordered Multiple-Choice (OMC)

Psychometricians are obsessed with item discrimination (producing a desired spread of student scores with the fewest number of items) and test reliability (getting the same average test score from repeated tests). Teachers and students need to know what has been mastered and what has yet to be learned. These two goals are not fully compatible.

In fact mastery produces a score near 100%; what has to be learned, a score of near 0%; but psychometricians want an average test score of near 50% to maximize their favorite calculations. Traditional multiple-choice (TMC) generally produces a convenient average classroom test score of 75% (25 points for marking, each item with four answer options, and 50 points from a mix of mastery and discriminating items).

The TMC test ranks students by their performance on the test and their luck on test day. It does not ask them what they really trust they know, that is of value, that is the basis for further learning and instruction (the information needed for effective formative assessment).

Pearson announced a modification to TMC in 2004 (distrator-rationle taxonomy). In 2010 Pearson reported on a study using ordered multiple-choice (OMC) that still forces students to mark an answer to every item rather than use the test to report what they actually trust they know or can do (the basis for further learning and instruction).

The first report introduced OMC. The second demonstrated that it can actually be done. OMC ranks item distractors by the level of understanding.

Other themes and counts of distractors can also be used. This method of writing distractors makes sense for any multiple-choice test. The big difference is in scoring the distractors.

An OMC test is carried out with the weight for each option determined prior to administering the test. This requires priming (field testing) to discover items that perform as expected by experts. With acceptable items in hand, the test is scored 1, 2, 3, and 4 for the four options – four levels of understanding -- (Minimal, Moderate, Significant, and Correct).

TMC involves subjective item selection by a teacher or test expert with right/wrong scoring. This ranks students. OMC involves both subjective item and subjective distractor selection with partial credit model scoring. OMC is a refinement of TMC.

OMC student rankings include an insight into student understanding. How practical OMC is and how it can be applied in the classroom is left for further study. I would predict it will be used in standardized tests in a few years after online testing provides the needed data to demonstrate its usefulness.

The OMC answer options are sensitive to how well a test matches student preparation. This fitness, the expected average test score when students do not know the right answer and guess after discarding all the options they know are wrong, is calculated by PUP520 for each test. This value can equal the test design value (25% for a 4-option item test) to above 80% on a test that closely matches student preparation.

[All tests make a better fit to one small group of students, and a worst fit to another small group of students, than to the entire class. This is just one part of luck on test day. There is no way to know which students are favored or disfavored using forced-choice testing. Judgment Multiple-Choice (JMC) permits each student to control quality independently from quantity.]

Another factor to consider when using OMC is that the number of answer options could be reduced to three (Minimal, Moderate, and Correct) to increase the portion of distractors that work as expected. Knowledge Factor only uses three answer options and omit (JMC) in its patented instruction/assessment system that guarantees mastery.

My suggestion is to add one more option to OMC: omit. Then student judgment would also be measured along with that of the psychometricians and teachers. Judgment ordered multiple-choice (JOMC) would then be a refined, honest, accurate, and fair test.

We would know what students value as the basis for further learning and instruction by letting them tell us. This makes more sense than guessing what a student knows when 1/2 of the right marks may be just luck on test day.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.