Wednesday, August 24, 2011

Standardized Testing - Structure, Function, and Operation

The structure, function and operation of standardized testing must all be considered when evaluating the usefulness of test results. Standardized test results are not always what they are claimed to be. When mixed with politics, they usually have even less value, as will be discussed near the end of this post.

Standardized testing involves test score distributions (statistical tea leaves). Their two most easily recognized characteristics are the average score, or mean, and the spread of the distribution, or standard deviation (SD).

Two methods of obtaining score distributions are now in use. The traditional method, counting right marks on a multiple-choice test, is the same as used on most classroom tests. The Rasch model method, used by many state education departments, converts test results to estimated measures of student ability and of item difficulty.

The value of multiple-choice test results depends upon how the test is administered. Both methods allow for two modes: forced choice, mark every answer, and student choice of items that can be used to report what the student trusts that is useable as the basis for further learning and instruction.

The following table relates the above four combinations to two software programs, the fixed, reproducible, structures that produce score distributions. Power Up Plus (PUP) and Winsteps produce score distributions for classroom use and for standardized testing. 

Three of the four modes produce traditional right count quantitative score distributions: Quantity Scores. KJS adds a quality score that is comparable to the full credit mode measure distribution.

The distribution of scores from a traditional multiple-choice test can be a good indicator of classroom performance (teacher and student). As a standardized test, only counting right marks places as much, if not more, emphasis on test performance as on student performance. Items are carefully selected to produce a predicted score distribution. This score distribution is expected to match some subjectively set standard (cut score) such as grade level or job readiness. But how the test is administered changes the value and meaning of key functions. Forced choice and student choice produce two different views from the same students.

For many historical reasons, including tradition and short-term accountability, NCLB has used the forced choice mode that only assesses and promotes the lowest levels of thinking. It is fast, cheap, and ineffective. Testing, and unfortunately as a result teaching, limited to the lowest levels of thinking is more counter productive the longer students are exposed to it. This may be an underlying factor in the poor showing made by high school students, in general, in relation to the lower grades (the spread between the levels of thinking required and that seniors possess my contribute to the current emphasis on senior attitude).

When students are allowed to report what they trust as a basis for further learning and instruction, a wealth of information becomes available for student counseling to direct student development. PUP allows students to switch from forced choice to reporting what they know when they are comfortable doing so.  Knowledge Factor is a patented instructional/assessment system that guarantees mastery learners. Development to use all levels of thinking is critical to success in school and in the workplace.

Many ways of operating standardized testing have been used in assessing students for NCLB. Multiple-choice was derided at first and then returned as the primary method. Almost everything that is not assessed by actual performance can be usefully measured with multiple-choice (A, B, C, D and omit). Traditional multiple-choice was crippled by dropping the option to omit (don’t know) early on. Just counting right marks was easier and gave a useable ranking for grading. How the rank relates to what a student knows or can do is still an open debate. Knowledge and Judgment Scoring settles this matter with a quality score.

A test maker (teacher or standardized item author) has all of the above structure and function options to consider when creating an operational test. The value of the final test results depends upon how the options are mixed and handled (a simple ranking or an assessment of what is known and can be done along with the judgment to use it well).

Test banking can be very simple. It can be a list of 25 questions that is edited each semester. The test is then scored by any one of the above four modes. The choice depends upon the use of the results. RMS ranks students and permits comparing your success from year to year. KJS and the partial credit Rasch model explores which students are still lingerers, followers and self-directed learners. The quality score can point out what each student knows or can do as the basis for further learning and instruction regardless of the test score.

Test banking can be very complicated, time consuming, and expensive.  Winsteps appears to be about the least complicated, the least time consuming and the least expensive way for standardized testing. It has been used by many states.

A test bank is created from items that have been calibrated by Winsteps. A high scoring sample will produce items with low difficulty. A low scoring sample will produce items with high difficulty. Equating, with the use of a set of common items, can bring these together if the two samples are believed to be from the same population. Winsteps does not know to do this on its own. When and how to equate requires an operational decision.

However the operations are carried out, human intervention is needed to start it and thereafter at about every other step. Standardized testing is still a mix of art, science and politics.  

A benchmark test is selected from the test bank. A range of item difficulties is selected to match the population to be assessed. A small common item set is included. The mean and standard deviation of the predicted distribution are calculated.  Time and money permitting, the benchmark test is administered one or more times. Now a known mean and standard deviation are in hand for the distribution. This ends research.

An application test is administered to the full population: every Algebra I student in the state, for example. This operational test also contains a set of common items used in creating the benchmark test. Winsteps scores the application test.

Resolution of the test results is not the same as equating items for a test bank. Winsteps can be used here in the same manner as in test banking, but the environment is now very different. A pre-application public declaration of cut scores is no longer recommended due to newly found (Feb 2011) sources of score instability. If the operational test has not performed as expected, the needed adjustment can favor the desired performance for the average score, the cut score, the scaled score, the percent passing, or the percent improvement. Public exposure of average scores has been requested by the Center on Education Policy (CEP), Open letter to the member states of PARCC and SBAC, May 3, 2011. Everyone can then know the starting point for whatever resolution adjustments are made. This would help reestablish public trust and increase the value of test results.

Test banking data can be liberally culled to obtain the best fit of data to the Rasch model because of the unique properties of the model. That same liberal attitude is, in my opinion, not justified when manipulating the operational test results.

The final step for Winsteps is the conversion of measures to expected raw scores. The conversion is a matter of changing log units to normal units when the test results are not manipulated. No human judgment is required. A normal bell curve distribution is again created.

This brings this series of posts related to the high jinks exposed in several state education departments to an end. Over the past few years several states have displayed marked deficiencies in their short-term competition for federal money and adequate yearly progress (AYP) including Texas and Illinois (part of the motivation for this year long investigation into Rasch model IRT test analysis). During this last year New York presented the worst example I know of. In my opinion the recent cheating scandals in Georgia will have done less damage to students, teachers and schools than the manipulation of New York state test results by state officials.

Arkansas, on the other hand, has posted almost perfect examples for AYP on NCLB tests for over a ten-year period: 2001-2011 End-of-course Comparison.

(The percent combined proficient and advanced is a derived value. Average test scores, and related cut scores, are based directly upon student marks on the test.)

This demonstrates exceptional skill in managing test performance. Such a performance has therefore invited suspicions of the test becoming more standardized on test performance (the test score) than on student performance (what students know and can do). Were that to be true, it would make Arkansas a good case of successful well-intentioned self-deception, created by instruction (curriculum), learning (level of thinking) and assessment (test items) being optimized for NCLB test results. These doubts are probably not valid given the awards won and leadership demonstrated by Arkansas. Comparison with NAEP also shows that two different views of the same students can vary a great deal. Both views may be validated with sufficient student performance information to clarify what each test is testing. Arkansas has also equated classroom and state test scores as part of their management of grade inflation (again, two views of the same students).

Replacing the national academic lottery conducted with right count scored tests with tests that actually assess what students know and can do, as the basis for further learning and instruction, is one way of clarifying this situation (Knowledge and Judgment Scoring and the partial credit Rasch model, for example). The same tests now used for ranking can also be used when upgrading classroom testing (to assess both quantity and quality) to better prepare students for whatever forms of questions are used on the new NCLB tests. There is a great increase in useful information for students and teachers to direct classroom assignments and activities at all levels of thinking. Or replace the classroom with a complete instruction/assessment package like Amplifire.

The spread of certified competency-based learning may help bring about the needed change in assessment methods. A test must measure what it claims it is measuring. The test results must not be subject to a variety of secretive factors that only delay the inevitable full disclosure. “You can fool part of the people (including yourself) part of the time, but not all of the people all of the time.” The software packages are honest. It is how they are used that is open to question.

Wednesday, August 10, 2011

Grading Clicker Data

The clicker data provided by GMW11 can be assigned grades in many ways. A traditional multiple-choice curve used by GMW11 produced 3 A, 1 B, 24 C, 24 D, and 69 F grades with an average score of 34%.

A typical Knowledge and Judgment Scoring (KJS) distribution, with letter grades set every ten percentage points, would be 1 B, 3 C, 13 D, and 104 F grades. A KJS curve comparable to a right mark scoring (RMS) curve yields 4 A, 3 B, 15 C, 31 D, and 68 F grades with an average score of 49%. The same number, 69 and 68, are passing on each test.

Comparable curving produced similar grade distributions. However, what is being assessed and rewarded is very different. A RMS curve is based on a student’s luck on test day (both in marking and in the selection of questions presented on the test). A KJS curve is based on each student’s self-assessment, it combines knowledge and judgment in selecting questions to use to report what is actually known. Top students earn the same grades by both methods, as do most poor students.

High quality, self-assessing, students earn a reward for reporting what they can trust as the basis for further learning and instruction. The sharper the incline connecting RMS and KJS scores on the chart the higher the quality. High quality students are teachable. KJS identifies them. RMS does not.

By scoring the clicker data by both methods and curving the scores in the same manner, the difference in student performance on the two scoring methods is clearly exposed. The task of the RMS student is to mark the best guess of a right answer for each question. Understanding, problem solving, and reading ability are secondary and even, at times, unnecessary. These are all crucial for a KJS student to determine if a question can be used to report something that is understood or which has sufficient relationships with other information or skills that a verifiable right answer can be marked.

In this day, all multiple-choice tests should offer both methods of scoring. Students can easily switch from lower to higher levels of thinking; from little responsibility to near full responsibility for learning. Successful implementation requires letting students make the switch. Forcing students into KJS is about as unproductive a thing to do as forcing them to mark an answer to every question on a test they cannot understand or at times even read. Power Up Plus scores both methods, as does Winsteps (full credit and partial credit Rasch IRT models). No additional preparation time or effort is needed beyond that required for creating any multiple-choice test.

To the student: Your highest score/grade is obtained by being honest in reporting what you know, understand, and can trust at any level of preparation.

To the teacher: You know what each student can do and understand as the basis for further learning and instruction.

To the administrator: You know the levels of thinking, for each student, and in classroom instruction, as passive pupils prepare to be independent learners (self-assessing, self-correcting scholars).

Knowledge and Judgment Scoring promotes student development when used on essay tests, multiple-choice tests, and I would suggest the same for clicker data.

Wednesday, August 3, 2011

Scoring Clicker Data

I was recently presented with some clicker data to examine (GMW11). It had been scored by traditional right count scoring. There were a number of scores below 20%. There were even four students with a score of zero. That is way below the average guessing score when using five options on each question.

A different score distribution was produced by scoring the data for both knowledge and judgment. This distribution looks very much like what one would expect from students on their first introduction to Knowledge and Judgment Scoring. Students earn the same scores by both methods of scoring when they fail to exercise their own judgment (mark an answer to every question).The top three students therefore obtained the same score with both methods of scoring.

Here is an opportunity to compare Right Mark Scoring (RMS) and Knowledge and Judgment Scoring (KJS) when used on any multiple-choice test. There is one catch, both methods of scoring are being used on one, the same, set of answer sheets.

Normally students would elect which method they felt comfortable using (and if time permits, on the first or second test, they may fill out two answer sheets, one for each method of scoring). The same test data can support a number of different stories. This story will assume that the test was presented with a choice of RMS and KJS, and further, that this was the first such test for the class. Most would be expected to select what they are most familiar with: RMS.

When quantity and quality are scatterplotted from RMS data, the result is a straight line. Only one dimension is being measured: a count of right marks.
KJS data are two-dimensional. A range of quality scores can yield the same test score. The test score of 46% was earned by students with a range of quality scores from zero to 44%.

Higher quality students are found above 50%. Lower quality students are found below 50%. Higher quality students get higher scores by marking more right answers. Only one student marked a perfect 100% quality score (no wrong marks).

Lower quality students get lower scores by marking more wrong answers. Four marked a zero quality score (every one of their two to 10 marks on the 23 question test was wrong).

Quantity and quality have been given equal value. The active test score then starts at 50%: 1 point for right, 1 point for good judgment (omit or right), and zero for wrong (poor judgment). [Back in the 1970s, when this work first began, the active test score started with zero. It was called net yield scoring; right minus wrong. The discovery of the quality score produced the second dimension that assesses student performance rather than defaulting to luck on test day.]

The end result of training students to accurately report what they trust they know or can do is shown in the Fall88 scatterplot. After an initial test (such as the clicker data) where most students elect RMS, they change study habits, and voluntarily switch to KJS. Here most of the class show a quality score about one letter grade higher than their test score. There is a bit of a disconnect at the pass/fail line of 60% (70% C, 80% B, and 90% A). Experienced students feel more comfortable reporting what they know than guessing at answers on all items on the test. They are on the path to being independent learners (self-correcting scholars).

This is in contrast to traditional right mark scoring where any score can be one letter grade higher with good luck to one letter grade lower with bad luck than a student’s actual ability. A grade of B one day may be a D on another day. And no one, including the student, knows what the student actually knows and can do as the basis for further learning and instruction.

Grading has an important effect on which scoring method students select: RMS (“I mark, you score”) or KJS (self-assessment, “I tell you”). RMS students tend to cram and to match. KJS students bring a rich web of relationships (from learning by questioning, answering, and verifying) that they can apply to questions they have not seen before. There is an operational difference between remembering and understanding that can be measured (RMS vs. KJS).