Multiple-Choice Reborn: August 2012

Wednesday, August 29, 2012

NCLB Accountability

The history of accountability from the school and to the state department of education level has been quit varied. Only after running this preposterous natural experiment for ten years is it being challenged in ways that may be effective in either bringing it to an end or in correcting its excesses. Congress created this absurd monstrosity by setting an impossible goal for all students to meet. It then reneged on its oversight responsibility to act in good faith to avoid many of the unintended consequences that occurred (it is now five years behind the time it should have acted on needed changes).

Self-regulation is a lofty idea. It has failed miserably for mortgages, for wall street derivatives, and in the futures market (all of which were presumably being regulated). The same can be expected in state departments of education that must come up with acceptable numbers to obtain federal funding. The two consortia (PARCC and SBAC) promoting the Common Core State Standards have the promise of serving as checks on one another. This is an expensive and ambitious political solution that may have its own down side depending on implementation. If all states will release actual student test scores there will be a way to determine how creative states are in determining passing rates.

The passing rate has been politically exploited in several states. New York is the prize example. Diane Ravitch posted, 21 February 2012, “Whence came this belief in the unerring, scientific objectivity of the tests? Only 18 months ago, New York tossed out its state test scores because the scores were unreliable. Someone in the state education department decided to lower the cut scores to artificially increase the number of students who reach proficient. No one was ever held responsible.”

Michael Winerip posted, 10 June 2012, “Though this may be the worst breakdown in 15 years of state testing, it does not appear that Florida politicians have any interest in figuring out who was responsible. The commissioner? Department officials? Someone at Pearson, the company that scored the writing tests?” Winerip reports further that, “The audit referred to lowering the passing score to 3 as ‘equipercentile equating’”. That is, the score was lowered until the same portion of students passed this year as passed last year. [As I am writing this, the commissioner resigned.]

As is the case with mortgages, derivatives, and futures, it is difficult, in most cases, to say if a crime has been committed or just very poor judgment was exercised, until the chain of events is carefully studied including the false "belief in unerring" test scores. My own explanation here is that research results and application results are not the same thing. In research you predict acceptable results. In application you examine the results for meaningful useful relationships (equipercentile equating to obtain the desired pass rate – any relationship between the ranking on the test and what students actually know or do is mostly coincidental near the cut score).

Has a crime been committed? In the case of New York, and other states that must now “explain” why student tests scores are dropping on new tests, I would say, “Yes.” For states that choreographered an almost perfect, ever slowing, increase in the pass rate for the past 8 to 10 years, the answer is problematic. It can range from outright cheating to self-deception. From equipercentile equating, to selecting test items that produce the desired results, the standard practice for classroom test score management.

On 18 May 2012, Valerie Strauss posted the white paper released by the Central Florida School Board Coalition. This lengthy paper details the unintended consequences and the downright sloppy test items used on their standardized tests. My own software, Power Up Plus (PUP) can pick such items out when run on a notebook computer in a matter of minutes. I am amazed that such items are used considering the millions of dollars spent on development and administration of their tests. I strongly suspect their developmental process.

Cory Doctorow posts, “The Test Item Specifications are the guidelines that are used to write the test questions. If the Science FCAT test is reviewed by the same Content Advisory Committee that reviewed the Test Item Specifications, then it probably has similar errors.” From my experience, a valid test item must assess exactly what it says (concrete level of thinking – what you see is what you get) or be an indicator of knowledge or skill of things in the same class (1 + 3 = 4 to assess addition of integers). Questions that have different right answers based on levels of thinking, socio-economic status, state, religion, politics, ethnicity, and current political correctness are not to be used. That is, stick to the topic, not to what the topic (agenda) or skill may be used for. Where a question is on topic but has different answers related to the above, it should stand. This is part of the broadening effect of education. A recent example is the Missouri constitutional amendment voted on yesterday to protect the religious rights of school children.

In the case of Florida, once again, faulty predictions were made based on some type of research. The entire system (instruction, learning, and assessment) was not fully understood or coordinated with disastrous results. And again, another state education official has resigned. Was this a crime or just a waste of millions of dollars and millions of instructional and learning hours?

On 24 April 2012 Valerie Strauss posted the National Resolution Protesting High-stakesStandardized Testing that is based on the Texas Resolution Concerning High Stakes, Standardized Testing of Texas Public School Students. These two resolutions combined with the Florida white paper make a strong political protest that may take years to obtain results. The desired results are not specifically stated. We are back to the days of “alternative assessment”: Do something different; but doing something different must be at least as good as what we have, or we lose again; as with the authentic assessment and the portfolio movements.

As well intentioned as all the people are working on this assessment problem, most are still riding their safe steady tricycle, the traditional force-choice multiple-choice test scored at the lowest levels of thinking, that they were exposed to long before it was used for NCLB testing. It actually worked fairly well back then when the average test score was 75% or higher. It has failed miserably when NCLB test score cut scores dropped below 50% (a region where luck of the day determines the rank of pass or fail). Only until these people are willing to get off their old tricycles will they have any interest in getting on a bicycle (where students can actually report what the know and can do).

Knowledge and Judgment Scoring allows students to individualize standardized tests, to select the level of thinking they will use; to guess at right answers, or to use the questions to accurately report what they actual know and can do. We only need to change the test instructions from, “Mark an answer on each question, even if you must guess” to “Mark an answer only if you can use the question to report something you trust you know or can do.” Change the scoring from “two points per each right mark and zero for each wrong mark” to “zero for each wrong mark, one point for omit (good judgment not to guess – to make a wrong mark) and two points for each right mark (good judgment and right answer)”.

We can now give students the same freedom as given with essay, project, or report to tell us what they actually know and can do. We also have the option to commend students as on other alternative tests, “You did a great job on the questions you marked. Your quality score of 90% is outstanding.” This quality score is independent of the quantity score. You can now honestly encourage traditionally low scoring students for what they can do rather than berate them for what they cannot do (or for their bad luck on traditional forced-choice tests).

The NCLB monster may be controlled by legal action (see prior post), political action, or by just changing its temper from a frightening bully to an almost friendly companion. Just select your breed: Knowledge and Judgment Scoring in PUP, Partial Credit Model in Winsteps, or Amplifier by Knowledge Factor.

Why continue just counting right marks; making unprepared liars out of lucky winners and misclassifying for remediation unlucky losers? We know better now. It no longer has to be that way. “Whence came this belief in the unerring, scientific objectivity of the [dumb, forced-response, guess] tests?” We need to measure what is important. We do not need to make an easy measurement (forced multiple-choice and essay at the lowest levels of thinking) and then try to make the results important (at higher levels of thinking). There is a difference in student performance between riding a tricycle and a bicycle. We cannot hold students responsible for bicycling if they only practice and test on tricycles.

Wednesday, August 22, 2012

Rasch Model Audit Finished

It was over two years ago that I took on the task to audit, make sense of, the Rasch model IRT analysis as performed by Winsteps, the software many state departments of education have used with NCLB standardized testing. It is now 46 posts later (they are being released on weekly intervals). The Winsteps software performs as advertised for psychometricians and test makers. No pixy dust is required. But the developmental environment it creates is problematic.

First is the collection of features provided to cull raw data to fit the perfect Rasch model. It creates a very freewheeling environment in which items and students are culled to get the results to “look right”. This is not a problem for competent experienced operators. In fact it is probably the best set of features for their work. But it is also an invitation to cheat in the hands of desperate state education officials needing results that “looks right” to obtain federal funding. It is an invitation that has been taken by politically motivated city and state officials (see prior posts that prompted the audit).

Early on some officials lost their jobs for making predictions based on research results that were not supported by application results. That no longer happens for a number of reasons. [Edit: It just happened again in Florida.] No one, to my knowledge, has been charged with a crime when test results and cut scores were manipulated for political gain; even when several million dollars were lost and millions of hours of instruction and learning time diverted to cramming for “the test” that at best produced a meaningless ranking, and worst, destroyed the validity of the results for instructional improvement and assessment of teacher performance.

A new threat to schooling is now being developed by predatory testing companies in collaboration with state departments of education, several of which were cheating. Two groups have formed (PARCC and SBAC). This is a good development. Each group can serve as a check on the other. One requirement being voiced outside these two groups is that the actual test scores must be published rather than the passing rate or a ranking of good, better, and best. Whenever the actual test scores are hidden there is no way to know what passing rates or rankings mean. However, the real threat is that, in the name of “formative assessment,” they want to increase standardized testing from once a year to three or four per year. One test has been destructive enough. More would only make things worse except for the predatory testing companies that will publish and score the tests and test preparation materials.

We have a problem. The federal government has shown its inability to improve education in the classroom by bullying schools with money. This failed program is now culminating in a political revolt, and perhaps, in legal action (see below). State departments of education have been too weak to resist being bullied: they needed money. Most public schools are operated as failing establishments by design (lesson plan, teacher presentation, and student performance rank [how well students adjust to the needs of the school’s administrators, staff, and teachers] at the lowest levels of thinking). Schools designed for success are organized around student learning at all levels of thinking. They adjust to the needs of their students. Failing school administrators are more managers than leaders. The result is their teachers are no longer free to be professional teachers (responding to their student's needs). In some states teachers now function as readers of the assigned daily lesson, as coach for "the test", and as day care servants (responding to the perceived needs of the bureaucracy).

There is one branch of government that I have yet to see get involved. That is the office of the state attorney general. That branch of government has the responsibility to prevent tax payers from being ripped off by unethical business practices. Predatory testing companies in conjunction with self-serving state education officials may be about to pull off the biggest scam yet: not one but three or four standardized tests per year per course that, for the most part, will be scored at the lowest levels of thinking (using traditional multiple-choice and high speed evaluated essays).

If they justify this use of multiple tests as “formative assessment”, it will be a patent lie; pure fraud; marketing a product under the guise of a current education fade. Formative assessment occurs in seconds in the heads of students functioning at higher levels of thinking. It occurs in the classroom in minutes between students and teachers (in person and through educational software). It does not occur in weeks or months by way of standardized tests (unless you are promoting the tests). Marion Brady sums the situation up in neat operational terms that everyone can understand: "Do not subject my child to any test that doesn't provide useful, same-day or next-day information about performance."

Are you, or do you know someone, in the office of your state attorney general who is interested in forming a group to act on this matter? Nine-Patch Multiple-Choice, Inc, for-profit, now has facilities to support this action. It can perform multiple-choice test analyses at both lower and higher levels of thinking (right count scoring and Knowledge and Judgment Scoring); classical test theory (CTT) and item response theory (IRT); and offers students the choice between being assessed by traditional guessing or reporting what they trust. I would suggest that you submit a test data set of 25 to 50 questions and 120 to 300 students for analysis as an entry into the group. We need real test data. Not research or theoretical data. Once a group is formed Nine-Patch Multiple-Choice, Inc. can be re-registered as a non-profit, to give the mission even greater stability. Email rhart@nine-patch.com or phone 1-573-808-5491.

Currently there is no way to audit and verify test results from predatory testing companies or state departments of education. A proposed law could require that random samples be selected, and be examined by an independent company or the office of the state attorney general using appropriate software, that currently runs on a personal computer, and produces meaningful results in minutes rather than months (see next post).

Richard A. Hart, PhD

Professor of Biology, Emeritus, NWMSU - 1990

Treasurer - 1992

Educational Software Cooperative

Treasurer and Founding Board Member - 1994

Educational Software Cooperative, Inc. (non-profit)

Website: http://www.edu-soft.org

President - 2006

Nine-Patch Multiple-Choice, Inc. (for-profit)

803 Somerset Drive, Columbia, MO 65203-6436 USA

Email: rhart@nine-patch.com,

Website: http://www.nine-patch.com

Phone: 1-573-808-5491
Organizer - ????
Nine-Patch Multiple-Choice, Inc. (non-profit)

Wednesday, August 15, 2012

Student Quality

We need both quantity and quality, but if a choice must be made, quality generally wins, expect in current academic testing. The traditional right count scored, forced-choice, version of multiple-choice assessment ties quantity and quality together in one meaningless ranking. It extracts the least information from the answer sheets. Because of this, there have been many movements (fads) to improve assessment from alternative assessment, authentic assessment, portfolios, projects, and reports to actual oral and visual presentations. In the end, traditional multiple-guess has always won out for some very good reasons: cheap, fast, easy to do and highly reproducible results.

Multiple-choice assessment does not have to be this way. Just change the instructions a bit and you have assessment at all levels of thinking as well as cheap, fast, easy to do, highly reproducible and meaningful results. Allowing students to accurately report (on multiple-choice tests) what they trust will be of use in further learning and instructing is not something new in 2012. Geoff Master from Melbourne, Australia, developed the partial credit Rasch model (PCM) that is included in Winsteps prior to 1982.

While teaching at Northwest Missouri State University, USA, along with several 1000 remedial biology students, I developed Knowledge and Judgment Scoring (KJS) in 1981 to obtain an individualized written report from each student that accurately assessed what each student really knew (from lecture, laboratory, and assignments) and on which further meaningful learning could be built. As one faculty member working with pre-med students put, “We know what they know and how well they know”. This method of scoring was crucial in providing the information needed with which to guide each student along the path from passive pupil to active, self-correcting, scholar. It made possible servicing a class of 120 remedial biology students more effectively and with less effort than 24 students in a class with “blue book” exams.

James Bruno made extensive studies in assessment at the University of California in Los Angeles. In 2005, Knowledge Factor patented an educational system (Confidence Based Learning – CBL, now Amplifier) based on his work with great success in the professional development and competency assessment area. Knowledge Factor sets the bar for quality at 75% or higher. KJS sets it at 50%. Traditional multiple-choice sets it at zero (passive scoring – when scoring the finished answer sheet) and at 25% for four-option questions (active scoring – when taking the test).

Both PCM and KJS produce the same test scores. They both also provide estimates of student quality. This illusive property is often discussed as only to be found in “alternative assessments”, alternative to traditional multiple-choice (for the majority of uninformed and un-relearning educational reformers). Quality by alternative assessment is very subjective. Quality by PCM and KJS is not. Quality by PCM and KJS is also highly reproducible.

PCM and KJS produced comparable quality indicators on a remedial general biology test for four students that had a 70% test score. The Student Normal (+) Output values on the table have been corrected by adding 25% to each value to match the Item Normal (+) Output value mean (see the full details on the 3 October 2012 post on the Rasch Model Audit blog, Rasch Model Student Ability and CTT Quality).

PCM and KJS Quality Indicators
Method	Student (70% Test Score)
	26	37	40	44
KJS	81%	88%	88%	95%
PCM	68%	76%	76%	88%

These quality indicators cannot be expected to have the exact same values as they include different components. KJS divides the number of right answers by the total number of marks a student makes to estimate quality (% Right). The number of right marks is an indicator of quality. The KJS student test score is a combination of quantity and quality (PUP uses a 1:1 ratio that every student can understand). If a student elects KJS but ends up marking most of the questions, the KJS assessment automatically turns into a traditionally right mark scored test with no penalties (except for the traditional 3 out of 4 wrong when forced to guess).

Knowledge Factor (KF) uses 3/4 for judgment and 1/4 for knowledge when working with high risk occupations (it also uses three-option questions instead of four or five options). This makes sense when setting the value for quality (judgment) three times greater than for quantity (knowledge). The examinee either knows or does not know (and is then coached and trained to seek help). No guessing is allowed when only mastery is the goal. Allowing one airliner to take off directly in the path of one landing is not a good thing.

KJS and KF only see mark counts of 0, 1, and 2. Winsteps combines student ability and item difficulty into one PCM expected score. The perfect Rasch model, implemented by Winsteps, sees combined student ability and item difficulty as probabilities from zero to 1. A question ranks higher if marked right by more able students. A student ranks higher marking more difficult questions than when marking easier questions. The end result is two comparable, but not exactly the same, estimates of quality from the two methods of analysis.

Knowledge Factor optimizes assessment and instruction for mastery in high risk occupations. Winsteps, PCM, is optimized for psychometricians and test makers. Both can be used in the classroom where mastery and the development of high quality students is important, not just a topic of conversation (this is in contrast to just passing). It is in sharp contrast to the traditional failing classroom where instruction and learning are conducted at lower levels of thinking in preparation for NCLB standardized tests.

Knowledge and Judgment Scoring, as presented in Power Up Plus (PUP) is an adaptation of holding students sufficiently accountable that they develop the skills of the self-motivated, self-correcting scholar (question, answers, and verify). Facts change. The skills needed to learn and relearn only develop more with use in a non-threatening environment. PUP provides students with the opportunity to voluntarily select reporting what they trust when they are ready to do so (switch from lower to all levels of think). It does this by scoring both methods: traditional guess testing and KJS. Over 90% of the students I worked with made the change after the second exposure (after two times on their new risky bicycles, where they learned to balance [to be the judge of what they knew], they readily gave up their tricycles). This was a new and empowering experience for many students, “I can do this!”

I have promoted KJS for over 20 years. It provides much of the information now lost using traditional RMS tests. It provides the guidance needed to move students from passive pupils to self-educating high achievers (including the current fad generally expressed as 21^stcentury skills – these skills have always been important for master achievers). But in a highly threatening environment created by federal government bullying, multiple-choice has been given a very bad name. The desire needed to risk, to relearn, that there are two very different multiple-choice assessment methods has been almost squelched.

Until KJS is offered on standardized tests, it still makes a great training ground for preparing for such tests as it makes very clear to each student, during the test (an effective formative assessment willingness to need to know moment), what each student has yet to learn (and what each teacher may need to “reteach” to students willing to learn).

When a student understands, he can answer questions he has never seen before. Students who made the switch in my classes also found they were also doing better in their other classes. Learning and reporting for your own empowerment is a lot more fun than learning for a classroom or standardized test conducted at the lowest levels of thinking (gambling for a passing score).

Both quantity and quality matter in alternative assessments, including PCM and KJS. They are more easily and less expensively assessable when multiple-choice is done right: PCM and KJS. Done right also promotes student development, to be a better learner: a weaver of relationships rather than a cataloger of isolated bits. Multiple-choice done right even guarantees mastery with KF.

Wednesday, August 8, 2012

Unexpected Student Performance

Two findings in the Rasch Model Audit posts are important in understanding what item response theory (IRT) analysis actually does. One relates perfectly to the PUP Table 3c. Guttman Mark Matrix (student test performance) and the other to the estimation of student quality (% Right).

A Guttman Mark Matrix is created by sorting student scores and question difficulties from high to low. Student scores are displayed vertically and question difficulties horizontally. The end result is that the highest scoring student and easiest item land in the upper left of the table; the lowest scoring student and most difficult question land in the lower right of the table. A diagonal line from lower left to upper right divides more able students and easier questions from less able students and harder questions.

In a perfect Guttman scalogram, all marks in the upper left would be right and all marks in the lower right would be wrong. The average test score and question difficulty would be 50%. The average score on the selected test, Fall8850a, is 70%. Therefore the scores and difficulties are distributed about the average test score of 70%.

Winsteps develops an “unexpected” value for each cell in the table. PUP imports this information to color “most unexpected” values in a dark color and “less unexpected” values in a light color on Table 3c. Guttman Mark Matrix. As can be seen, this is a perfect match.

Purple “most unexpected” omits are most prominent on the chart. The assumption is students with these abilities should not have omitted. However, this is based on another set of assumptions: all students learn at the same rate, understand at the same level of thinking, and have common experiences. There is a wide difference between theory and application in the classroom. This provides the basis for several stories.

One commonly occurring story is that students doing poorly are often “pulled out” of their classrooms for special attention. Could it be that the only five students that omitted question 16 were “pulled out” at such a time they missed the instruction the rest of the class received? They did the right thing by omitting rather than guessing. This same option should be available on standardized tests. Students should be able to omit to honestly adjust the test to their preparation. Guessing degrades the value of multiple-choice tests. Each right answer from a guess is a lie. It is not a report of what a student knows. It cannot be the basis for further meaningful instruction. It degrades the test score to a meaningless ranking. These five students sent a very clear signal of what happened (or in this case, did not happen) in the classroom.

Three students failed to use good judgment to accurately report what they trusted to be of value as the basis for further instruction and learning. Coloring makes this as obvious as scanning the poor judgment (PJ) column. Why did they guess? Two show several “most unexpected” right answers. One story is student Order 016 had his grade made, so he just marked whatever popped into mind and did not take the time to think through each question on this last hour test in the course. He not only answered many questions correctly that were expected as too difficult for his apparent ability, but also, missed several expected to be very easy for him. He turned his paper in early in the test period.

Student Order 035 presents a different story. She took more time but again displayed the behavior of a student not needing a high score on this last hour test or a student that just failed to voluntarily switch from guessing at answers (at lower levels of thinking) to reporting what she trusted (at higher levels of thinking). She has the same test score as three other students who exercised good judgment indicated by much higher quality scores (% Right). Quality is discussed in the next post.

The last exceptional behavior is from student Order 31. This student is a model (lower level of thinking, traditional, right mark scoring, guessing, forced-choice) test taker: every question falling to the upper left in the table is marked right and 2/3 of the questions falling to the lower right are marked wrong.

Clearly, the information from Winsteps Table 6.6 adds greatly to the ease of use of three PUP tables: 3a. Student Counseling Matrix, test maker view; 3b. Student Counseling Matrix, test taker view (unique to KJS); and 3c. Guttman Mark Matrix, combined teacher view.

Multiple-Choice Reborn

Followers

Blog Archive

About Me