Wednesday, December 19, 2012

Nebraska Assessment Four Star Update


The 2012 NeSA Technical Report contains the information needed to complete (and make corrections on) the Grade 3 Reading Performance chart. The reported portion passing was 76% for Grade 3 (Nebraska Accountability/NeSA Reading/Grade 3).


The observed average score reported was 70%. The estimated expected score was about 66% [No calibration values were given for 10 fairly easy items that were at the beginning of the 2012 test].

The students did better in 2012 on a test that may have been more difficult than in 2011. [The lack of the calibration data on 10 easier items is critical to verifying what happened.]

[All students were presented with all 45 questions. Although the test was taken online, it was not a computer adapted test (CAT). The test design item difficulty was 65%, which is 15% above CAT design (50%).

“Experience suggests that multiple choice items are effective when the student is more likely to succeed than fail and it is important to include a range of difficulties matching the distribution of student abilities (Wright & Stone, 1979).” (2012 NeSA Technical Report, page 31)

The act of measuring should not alter the measurement. The Nebraska test seems to be a good compromise between what psychometricians want to optimize their calculations and what students are accustomed to in the classroom. CAT at 50% difficulty is not a good fit.]


Fifteen common (core) items were used in all three years: 2010, 2011, and 2012. They are remarkably stable. It testifies to the skill of the test creators to write, calibrate, and select items that present a uniform challenge over the three years.

It also shows that little has changed in the entire educational system (teach, learn, assess) with respect to these items. [Individual classroom successes are hidden in a massive collection of several thousand test results.]

My challenge to Nebraska to include student judgment on standardized tests resulted in about the same number of hits on this blog as the letters mailed. No other contact occurred.

This means that standardized testing will continue counting right marks that may have very different meanings. At the lowest levels of thinking, good luck on test day will be an important contributing factor for passive pupils to pass a test where passing requires a score of 58% on a scale with a mean at 70%.

Students able to function at higher levels of thinking but with limited opportunity to prepare for the test will not be able to demonstrate the quality of what they do know or can do. Both groups will be ranked by right marks that have very different meanings.

The improvement in reading seen in the lower Nebraska grades (Nebraska Accountability/NeSA Reading) failed to carry over into the higher grades. Effective teachers can deliver better prepared students functioning at lower levels of thinking at the lower grades. [Student quality becomes essential at higher levels of thinking in the higher grades.]

Typically the rate of increase in test scores decreases with each year (average Nebraska Grade 3 scores of 65%, 68% and 69% on the 15 common items) where classrooms and assessments function at lower levels of thinking. Students and teachers need to break out of this short-term-success trap.

[And state education officials need to avoid the temptation many took in the past decade of NCLB testing to produce results that looked right. It is this troubled past that makes the missing expected item difficulty values for 10 of the easier 2012 test items so critical.]

The Common Core State Standards movement is planning to avoid the short-term-success trap. Students are to be taught to be self-correcting: question, answers, and verify. Students are to be rewarded for what they know and can do and for their judgment in using that knowledge and skill.

Over the long term students are to develop the habits needed to be self-empowering and self-assessing. These habits function over the long term, in school and in the workplace. They provide the quality that is ignored with traditional right count multiple-choice tests. In school, if you do not reward it, it does not count.

The partial credit Rasch model and Knowledge and Judgment Scoring allow students to elect to report what they trust they know and can do as the basis for further instruction and learning. Quantity and quality are both assessed and rewarded.

Nebraska can still create a five star standardized test.

Seasons Greetings and a Happy New Year!

Wednesday, December 12, 2012

Pearson Computer Adaptive Testing (CAT)


The tools psychometricians favor are most sensitive when a question divides the class into two equal groups of right and wrong. This situation only exists when scoring traditional multiple-choice (TMC) at one point in a normal score distribution: at an item difficulty of 50%.

The invention of item response theory (IRT) made it possible to extend this situation (half right and half wrong) to the full range of item difficulties. IRT also allows expressing item difficulty and student ability on the same logit (log odds) scale.

IRT calibrated items make computer adaptive testing (CAT) possible. Items are grouped by the estimated difficulty that matches the estimated student ability needed to make a right response 1/2 of the time.

Typically, students must select one of the given options. Omit, or “I have yet to learn this”, is not an included option. The failure to include student judgment is a legacy from TMC (see previous posts).

Traditional CAT is therefore limited to ranking examinees. It is a very efficient way to determine if a student meets expectations based on a group of similar students. It is the solitary academic version of Family Feud.

The game is simple. Answer the first question. If right, you will be given a bit more difficult question. If wrong, you will be given a bit less difficult question.

If you are consistently right, you finish the test with a minimum of questions. The same can be said for being consistently wrong.

In between, the computer seeks a level of question that you get right half of the time. If an adequate number of selections fall within an acceptable range, you pass, and the test ends. Otherwise the test continues until a time limit or item count is reached and you fail.

If doing paper tests for NCLB was considered the biggest bully in the school, CAT increases the pressure. You must answer each question as it is presented.

You are not permitted to report what you know. You are only given items that you can mark right about 1/2 of the time. You are in a world far different from a normal classroom. It is more like competing in the Olympics.

You are now CAT food. Originality, innovation, and creativity are not to be found here. Your goal is to feed the CAT the answer your peer group selected for you as the right answer 1/2 of the time (that is right, they did not know the right answer 1/2 of the time either).

Playing the game at the 1/2 right level is not reporting what you trust you know or can do. It is playing with rules set up to maximize the desired results of psychometricians. Your evaluation of what you know does not count.

Your performance on the test is not an indication of what you trust you know and can do, but it is generally marketed as such. This is not a unique regulatory situation.

Sheila Bair, Chairman of the Federal Deposit Insurance Corporation, 2006-2011, described the situation in NCLB in terms of bank regulators, “They confuse their public policy obligations with whether the bank is healthy and making money or not.” (Charlie Rose, Wed 10/31/2012 11:00pm, Public Broadcasting System)

Psychometricians confuse their public obligation to measure what students know and can do with their concern for item discrimination and test reliability. This has perpetuated TMC, OMC, and CAT using forced-choice tests. The emphasis has been on test performance rather than on student performance.

[Local and state school administrators further modify the test scores to produce an even more favorable end result, expressed as percent improvement and percent increase by level of performance, and at the same time they suppress the actual test scores. Just like big bankers gambling with derivatives!]

IRT bases item calibration on a set of student raw scores. Items are then selected to produce an operational test of expected performance from which expected students scores can be mapped. These expectations generally fail. Corrections are then needed to equate the average difficulty of tests from one year to the next year.

The Nebraska and Alaska data show that the exact location of individual student ability is also quit blurred. An attempt to extract individual growth (2008) therefore understandingly failed on a paper test, but showed promise using CAT.

CAT is now (2010) being promoted as a better way than using paper tests to assess individual growth far from the passing cut score. [Psychometricians have traditionally worked with group averages, not with individuals.]

Forced-choice CAT, at the 1/2 right difficulty level, is the most vicious form of naked multiple-choice. Knowledge Factor uses an even higher standard, but clothes items in an effective instructional system. Also all items assess student judgment.

The claims that CAT can pin point exactly what a student knows and does not know are clearly false. CAT can rank a student with respect to a presumably comparable group.

To actual know what a student knows or can do requires that you devise a way for the student to tell you. There is a proliferation of ways to do this that for the most part require subjective scoring. Most are compatible with the classroom.

My favorite method was to visit with (listen to) a student answering questions on a terminal. It is only when fully engaged students share their thinking that you can observe and understand their complete performance. This practice may soon be computerized and even made interactive given the current development of voice recognition.

Judgment multiple-choice (JMC) allows an entire class of students to tell you what they trust they know and can do without subjective scoring. JMC can be added to CAT. This would produce a standardized accurate, honest, and fair test compatible with the classroom.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, December 5, 2012

Pearson Ordered Multiple-Choice (OMC)


Psychometricians are obsessed with item discrimination (producing a desired spread of student scores with the fewest number of items) and test reliability (getting the same average test score from repeated tests). Teachers and students need to know what has been mastered and what has yet to be learned. These two goals are not fully compatible.

In fact mastery produces a score near 100%; what has to be learned, a score of near 0%; but psychometricians want an average test score of near 50% to maximize their favorite calculations. Traditional multiple-choice (TMC) generally produces a convenient average classroom test score of 75% (25 points for marking, each item with four answer options, and 50 points from a mix of mastery and discriminating items).

The TMC test ranks students by their performance on the test and their luck on test day. It does not ask them what they really trust they know, that is of value, that is the basis for further learning and instruction (the information needed for effective formative assessment).

Pearson announced a modification to TMC in 2004 (distrator-rationle taxonomy). In 2010 Pearson reported on a study using ordered multiple-choice (OMC) that still forces students to mark an answer to every item rather than use the test to report what they actually trust they know or can do (the basis for further learning and instruction).

The first report introduced OMC. The second demonstrated that it can actually be done. OMC ranks item distractors by the level of understanding.

Other themes and counts of distractors can also be used. This method of writing distractors makes sense for any multiple-choice test. The big difference is in scoring the distractors.

An OMC test is carried out with the weight for each option determined prior to administering the test. This requires priming (field testing) to discover items that perform as expected by experts. With acceptable items in hand, the test is scored 1, 2, 3, and 4 for the four options – four levels of understanding -- (Minimal, Moderate, Significant, and Correct).

TMC involves subjective item selection by a teacher or test expert with right/wrong scoring. This ranks students. OMC involves both subjective item and subjective distractor selection with partial credit model scoring. OMC is a refinement of TMC.

OMC student rankings include an insight into student understanding. How practical OMC is and how it can be applied in the classroom is left for further study. I would predict it will be used in standardized tests in a few years after online testing provides the needed data to demonstrate its usefulness.

The OMC answer options are sensitive to how well a test matches student preparation. This fitness, the expected average test score when students do not know the right answer and guess after discarding all the options they know are wrong, is calculated by PUP520 for each test. This value can equal the test design value (25% for a 4-option item test) to above 80% on a test that closely matches student preparation.

[All tests make a better fit to one small group of students, and a worst fit to another small group of students, than to the entire class. This is just one part of luck on test day. There is no way to know which students are favored or disfavored using forced-choice testing. Judgment Multiple-Choice (JMC) permits each student to control quality independently from quantity.]

Another factor to consider when using OMC is that the number of answer options could be reduced to three (Minimal, Moderate, and Correct) to increase the portion of distractors that work as expected. Knowledge Factor only uses three answer options and omit (JMC) in its patented instruction/assessment system that guarantees mastery.

My suggestion is to add one more option to OMC: omit. Then student judgment would also be measured along with that of the psychometricians and teachers. Judgment ordered multiple-choice (JOMC) would then be a refined, honest, accurate, and fair test.

We would know what students value as the basis for further learning and instruction by letting them tell us. This makes more sense than guessing what a student knows when 1/2 of the right marks may be just luck on test day.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, November 28, 2012

A Balanced Common Core State Standards Assessment


It is time for psychometricians, teachers, and students to get on the same track with the same unit of measurement (not motorcycles, bicycles, and tricycles). Psychometricians have been top dog, feared, secretive and their judgment unquestioned. Teachers have worked hard, but to my current knowledge, only in a case like Nebraska has their judgment made a meaningful improvement in test results. Students have been treated as inanimate commercial commodities.

Optimum test results can only be obtained when the playing field is leveled for all three stakeholders. It is currently optimized from the view of psychometricians who have been strongly influenced, at times, by political power, and more often silenced by golden handcuffs. The “anomalies” that have become public and then retracted (more than once in Florida) show us the fruits of one-stakeholder rule in student performance assessment.

And now we have the Common Core State Standards tests. Students would like an honest, accurate and fair test. Teachers and students would like to know what each student knows and can do and what each one has yet to learn. Psychometricians would like highly reproducible test results, which do not require (present the opportunity for) equating test results (exposing error in selecting test items of equal difficulty) from year to year, but do present the appearance of equal difficulty.

And then we have the secondary level stakeholders who demand (and who fund with millions of dollars) the test results, only be in the form of a ranking, that shows improvement each year. They also want to do this at the lowest cost. To date the secondary level stakeholders have held the field.

Why things are as they are is then not too difficult to understand if you ignore the marketing that often overstates what is actually being done. Assessments carried out as forced activities cannot produce a valid indicator of what students actually know and can do. Such tests can produce a valid statistical ranking for satisfying a state or federal law. And that is why and how the tests have been funded.

The Common Core State Standards movement suggests that the judgment of all three primary stakeholders is included and respected. No one party is to triumph over or manipulate the other two parties. This demands some changes in the way they interact.

Students should be given the option of exercising their judgment in responding to test elements. This is inherent in classroom folders. It is also present when students have the option to respond to 5 essay items out of 7 to 10 suggested on a test. An in the alternative form of multiple-choice (quantity and quality scoring) students select questions to report, what in their judgment, they trust they know or can do.

Teachers should be given the option of exercising their judgment in writing test items that provide insight into what students are learning from what they are teaching. This includes both subject matter and skills, and student development. Teachers should be able to report, based on their judgment, which group each student best fits such as below, meets, and exceeds standards, as in Nebraska. Taken together, these inputs capture in numbers the climate of the classroom.

Psychometricians must respect the needs of the other two stakeholders. The oversimplification of data collection and data reduction to obtain the highest possible (but questionable) test reliability needs to become a part of the history of a natural experiment (NCLB) that has gone on too long. What works nicely in the safety of the research laboratory cannot be directly applied to individual student performances and obtain meaningful results (other than a ranking).

IMHO the Common Core State Standards movement demands the inclusion of more of the classroom climate (instruction, learning, feedback) than what forced test student performances yield. The student must be given the option to report what is meaningful, useful and empowering. The mechanics are simple for the student: know and don’t know; can or can’t do. Mark an option, select a question, or perform a task when in your mind you can trust what you are doing (and that this can be used as the basis for further learning and instruction). 

Students want to succeed. Teachers want them to succeed. Psychometricians need to capture what students and teachers have accomplished by letting students report knowledge, skills, and judgment. Quantity and quality scoring captures all three. Forced performances capture only part of knowledge and skills.

This has been a long introduction to three charts that summarize the psychometrician’s view of a standardized test. The first view is the result of over simplifying the classroom environment. Only right marks are counted on multiple-choice tests, or right stuff (generally restricted to rubrics) is counted on other forms of assessment. A raw score distribution is divided into three to five parts with cut scores. This is purely a statistical concept that works with any sample of anything. Once you have it in hand, the next job is to ascribe meaning to it based on each psychometrician’s judgment. The data from Alaska indicate that about 1/4 of the time students of equal abilities switch categories from year to year. This is a sizable measurement error related to right mark scoring.
The second view includes teacher judgment (see Nebraska posts). The single distribution is now teased apart into three. The average test score is no longer 50% but near 70%. The three score regions (below, meets, and exceeds standards) now have meaning based on teacher judgment (standard deviation of 20%, for example). 


The third view includes student judgment to report what is actually known and can be done that is the trusted basis for further learning and instruction. This is what the Common Core State Standards movement states is now needed. This chart is speculative. I have no actual data for it. I do know from working with over 3000 students that the portion of a test score distribution below 50% almost vanishes with quantity and quality scoring. Also the variation (the standard deviation) is lower, giving better separation of students grouped by performance (standard deviation of 10%, for example).


The psychometrician’s view is simple, cheap and often illusionary. The teacher’s view becomes more meaningful. The student’s view completes a balanced assessment system.

In summary, the Common Core State Standards movement now demands a far better test scoring and analysis than used in the past. In the case of multiple-choice tests, the switch from right count scoring to quantity and quality scoring only involves a change in test instructions that permit each student to elect which method should be used to score the test (see prior posts). The test then yields results that students, teachers and psychometricians can, all together, agree looks right.

Software to do this has been in existence for over two decades. Winsteps (partial credit Rasch model IRT) and Power Up Plus (Knowledge and Judgment Scoring) are two examples. Winsteps has been a popular program for state departments of education during the NCLB decade (they only need to change test instructions to assess student judgment).

Power Up Plus (PUP) is a classroom friendly program developed to provide students a means to frequently report accurately, honestly, and fairly what they actually knew and could do that was of value to themselves. They used the test results to guide further learning. I used the test results to guide my instruction and their development (passive pupil to self-correcting high achiever).

What all of this comes down to is an inversion of the present hierarchy:
  1. Let students have the opportunity to earn a quality score of 80-90% regardless of the quantity score. Let students report what they really know and can do.
  2. Let teachers submit questions that have shown in the classroom to meaningfully group students by their understanding, ability, skill, and development. These are questions that measure something important: mastery, misconceptions, reasoning errors and etc. Also let teachers estimate student test performance (below, meet, and above standards) as a part of each standardized test.
  3. Let psychometricians do their best with counts that are based on real students and classrooms rather than conducting an academic game show. The current statistical concept for ranking students is IMHO an even less perfect match to the Common Core State Standards movement than to the NCLB standards.
This is one way to produce a balanced assessment system. The standardized test items grow from all learning experiences. Students are free to make an accurate, honest, and fair report. Psychometricians are free to moderate a meaningful assessment process.


Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, November 21, 2012

Alaska Student Assessment Three Star Rating


The Alaska Reading Standards Based Assessments contain three features worthy of a star. In 2011, they show a matched comparison analysis that provides an insight into the dynamic nature of student assessment. In 2001, they also contain traditionally set cut scores and questions that are easy enough to provide actually measurement of what students know and can do.

ONE STAR: Alaska recorded the scores of students who obtain an increased, decreased or the same score (stable) this year as last year on the reading test for 2008-2009, 2009-2010 and 2010-2011 in a matched comparison analysis. The charts present static and dynamic views.

The portion of students in the Far Below Proficient and Below Proficient Stable group remained the same for all three comparisons. The portion of students in the Proficient and Advanced Stable group show a very small decline from year to year. The portion of students showing a decrease in performance matched the portion showing an increase in performance. This is a static view.


The dynamic view shows much more is going on in this assessment system. The reason the two above Stable views were stable is that about the same number of students who tested Below Proficient last year, this year tested Proficient (improved in proficiency), and the same number who tested Proficient last year, tested Below Proficient this year (decreased in proficiency).


This balanced exchange also took place between Proficient and Advanced levels of performance. In total, about 26% of all students changed proficiency levels each year (about 6% of the students crossed each of the two cut scores in both directions).

There are several reasons for this churning. The most obvious is variation in student preparation from year to year (any one set of questions will match one portion of the students better than the rest of the examinees). Another is how lucky each student was on test day. This brings up test design.

TWO STARS: The Alaska test compares student performance (norm-referenced). This is the most common and least expensive way to create a standardized test. It also forces students to mark answers even when they cannot read or understand the questions. This is called right count scoring, the traditional way of scoring classroom tests. It produces a score that can be used to validly rank student performance.

THREE STARS: The 2001 Alaska Technical Report, page 18, shows the average test scores for Reading ranged from 67% to 72% for grades 3, 6, and 8. Scores above 60% can indicate what students actually know and can do rather than their luck on test day. (The publication of average raw test scores is now considered essential to permit validation of the test results and comparison with other states using the same Common Core State Standards test.) [The Spring 2006 Alaska Standards Based Assessments, Chapter 8, did not list the average raw test scores: no star.]

SCORE VARIATION: The 2001 report, page 25, also shows the standard error of measurement (SEM), an estimate of where each student’s score would land on the cut score divided distribution, if the student could repeat the test. The example for Reading grade level 3 shows that 2/3rds of the time the repeated test scores of student “A” would fall within the range of 388 and 442 scale score units (415 original score +-27 SEM). That is 27/351 or 7.7% of the test mean, or 27/600 or 4.5% of the full-scale score. (The SEM is derived from the test reliability and the standard deviation in scale score units. A smaller, more desired, SEM can be produced by a higher test reliability and a lower standard deviation.)

The standard deviation, of the raw scores and the scale scores, provides a more direct view of the variation in the student test scores, page 18. The standard deviation is the sum of the deviations of each student score from the test mean, that is squared, and is then divided by the number of scores (variance) which is then returned to a normal number by obtaining the square root (squaring makes all the deviations positive values otherwise they would add up to zero).

The average standard deviation for the nine, grade 3, 6, and 8, test raw scores was 8.8/30.1 or 29% of the test means; that is, 2/3rds of the time a student with an average score of 30.1 would be expected to have repeated test scores fall between 30.1 +-8.8 or 21.3 to 38.9 on a test with 42 points total. Converting all of this into log ratio (logit) units used by psychometricians produces slightly different results.

The average standard deviation for the nine, grade 3, 6, and 8, test scale scores was 83/349 or 24% of the test means; that is 2/3rds of the time a student with an average scale score of 349 would be expected to have repeated scale scores fall between 349 +- 83 or 266 to 432 on a scale score range of 500 points (100 to 600).

Both SEM and standard deviations show a large amount of uncertainty in test scores. The documentation of this churning is worth a third star. This inherent variation in an attempt to capture student performance in a number accounts for much of the churning observed from year to year. Scoring these tests for quantity and quality instead of just counting right marks would yield much more useful information in line with the philosophy of the Common Core State Standards.

THREE OTHER STARS: Alaska places emphasis on cut scores on a single score distribution (norm-referenced). Nebraska (see previous post) places emphasis on two other score distributions (two stars): It groups scores both by asking questions needed to assess specific knowledge and skills (criterion-referenced) and by teacher judgment into which group each student they know well fits. Cut scores fall where a student score has an equal probability of falling into either group.

Both Alaska and Nebraska have yet to include student judgment in their assessments (one star). When that is done, Alaska will have an accurate, honest, and fair test that better matches the requirements of the Common Core State Standards.

Most right marks will also represent right answers instead of luck on test day and less churning of student performance rankings. The level of thinking used by students on the test and in the classroom can also be obtained. All that is needed is to give students the option to continue guessing or to report what they trust they know.

*   Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Including student judgment will add student development (the ability to use all levels of thinking) to the Alaska test. The Common Core State Standards needs students who know and can do, but also who have experienced judgment in applying knowledge and skills.

Routine use of quantity and quality scoring in the classroom promotes student develop. It promotes the sense of responsibility and reward needed to learn at all levels of thinking, a requirement of the Common Core State Standards.

Software to do quantity and quality scoring has been available for over two decades. Alaska is already using Winsteps. Winsteps contains the partial credit Rasch model routine that scores quantity and quality. 

Power Up Plus (PUP) scores multiple-choice tests by both methods: traditional right count scoring and Knowledge and Judgment Scoring. Students can elect which method they are most comfortable with in the classroom and in preparation for Alaska and Common Core State Standards standardized tests.

Starting in 2005, Knowledge Factor has a patented learning system that guarantees student development. High quality students generally pass standardized tests. All three programs promote the sense of responsibility and reward needed to learn at all levels of thinking, a stated requirement of the Common Core State Standards movement.


Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, November 14, 2012

Scoring Judgment and the Common Core State Standards


How student judgment is to be scored by Common Core State Standards assessments has yet to be finalized. How student judgment can be scored is related to time and cost. There is little additional cost when integrated into classroom instruction (in person or by way of software), as formative assessment, with an instant to one-day feedback. Weekly and biweekly classroom tests take additional time. Summative standardized tests take even more time.

Common Core State Standards tests will be summative standardized tests. The selection of questions for all types of tests is subjective. The easiest type of test to score is the multiple-choice or selected response test. All other types of tests require subjective scoring as well as subjective selection of items for the test.

The multiple-choice test is the least expensive to score. The traditional scoring by only counting right marks eliminates student judgment playing a part in the assessment. A simple change in the test instructions puts student judgment into the assessment where judgment can carry the same weight as knowing and doing.

*  Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Traditional right count scoring treats each student, each question, and each answer option with equal value. This simplifies the statistical manipulations of student marks. This is a common psychometric practice when you do not fully know what you are doing. It produces useable rankings based upon how groups of students perform on a test; which is something different from being based upon what individual students actually know or can do (what teachers and students need to know in the classroom).

This problem increases as the test score decreases. We have a fair idea of what a student knows with a test score of 75% (about 3/4 of the time a right mark is a right answer). At a test score of 50%, half of the right marks can be from luck on test day.

These two problems almost vanish when student judgment is included in the alternative multiple-choice assessment. Independent scores for knowledge and judgment (quantity and quality) indicate what a student knows and to what extent it can be trusted at every score level. This provides the same type of information as is traditionally associated with subjectively scored alternative assessments that all champion student judgment (short answer, essay, project, report, and folder).

Multiple-choice tests can cover all levels of thinking. They can be done in relatively short periods of time. They can be specifically targeted. Folders can cover long time periods and provide an appropriate amount of time for each activity (as can class projects and reports).

Standardized test exercises run into trouble when answering a question is so involved and time is so limited that the announced purpose of demonstrating creativity and innovation cannot take place in a normal way. My own experience with creativity and innovation is that it takes several days to years. These types of assessments IMHO then become a form of IQ test when students are forced to perform in a few hours.


Quantity and quality scoring can be applied to alternative assessments by counting information bits, in general, a simple sentence. It can also be a key relationship, sketch, diagram, or performance; any kernel of information or performance that makes sense. The scoring is as simple as when applied to multiple-choice.

Active scoring starts with one half of the value of the question (I generally used 10 points for essay questions which produced a range of zero to 20 points for an exercise taking about 10 minutes). Then add one point for each acceptable information bit. Subtract one point for each unacceptable information bit. Fluff, filler, and snow count zero. 

Quantity and quality scoring and rubrics merge when acceptable information bits become synonymous. Rubrics can place preconceived limits (unknown to the student) on what is to be counted. With both methods, possible responses that are not made count as zero. Possible responses that are made that are not included in a rubric are not counted, but are counted with quantity and quality scoring. In this way quantity and quality scoring is more responsive to creativity and innovation. The down side of quantity and quality scoring, applied to alternative assessments (other than to multiple-choice), is that it includes the same subjective judgment of a scorer working with rubrics.

Standardized multiple-choice tests have been over marketed for years. The first generation of alternative and authentic tests also failed. This gave rise to folders and the return of right mark scored multiple-choice. The current generation of Common Core State Standards alternative tests appears to again be over marketed. 

We want to capture in numbers what students know and can do and their ability to make use of that knowledge and skill. Learning and reporting on any good classroom assignment is an authentic learning academic exercise. The idea that only what goes on outside the classroom is authentic is IMHO a very misguided concept. It directs attention away from the very problems created by an emphasis on teaching rather than on meeting each student’s need to catch up, to succeed each day, and to contribute to the class.

The idea that only a standardized test can provide needed direction for instruction is also a misguided concept. It belittles teachers. It is currently impossible to perform as marketed unless carried out online. Feedback must be within the critical time that positive reinforcement is achieved. At lower levels of thinking that feedback must be in seconds. At higher levels of thinking, with high quality students, feedback that takes up to several days can still be effective.

Common Core State Standards assessments must include student judgment. They must meet the requirements imposed by student development. Multiple-choice (that is not forced choice, but really is multiple-choice, such as the partial credit Rasch model IRT and Knowledge and Judgment Scoring) and all the other alternative assessments include student judgment.

All students are familiar with multiple-choice scoring (count right, wrong and omit marks). Few students are aware of the rubrics created to improve the reliability of subjectively scored tests. This again leaves the multiple-choice test as the fairest form of quick and directed assessment when students can exercise their judgment in selecting questions to report what they trust they actually know and can do.

For me, it gave me a better sense of what the class and each student knew and could do (and as importantly, did not know and could not do) as reading 100 essay tests. Knowledge and Judgment Scoring does a superior job of highlighting misconceptions and grouping students by specific learning problems in classes over 20 students.

Wednesday, November 7, 2012

Student Judgment and the Common Core State Standards


The Common Core State Standards require students to take more responsibility for learning and for reporting what they actually know and can do. This takes practice. It is much like changing from riding a tricycle to a bicycle. You can ride both on the same course but the bicycle requires that you learn to balance. And once you learn to ride the bicycle you will never go back to the tricycle.

Most standardized assessments assume a linear increase in knowledge and skills. This is acceptable at lower levels of thinking (with training wheels). But at some point, sooner for some students and tasks and later for other students and tasks, there is a large leap from rote memory to understanding. Students see this as an escape from a boring and seemingly never ending task, of following the teacher, to the freedom of realizing there are limits in which things are related in such a way they make sense and give a feeling of completeness, of mastery, of empowerment.

            I know my A, B, Cs. 

I can get a 90 degree angle from any 3, 4, 5 unit triangle.

I can increase my energy level by eating a good breakfast and not stuffing before going to bed.

Many highly marketed products are not worth buying.

What I want now and what I need are not the same thing.

The difference between traditional classroom multiple-choice and cafeteria multiple-choice is I must mark each question on the test but I don’t have to eat one of each product on the counter.

Getting from one place to another can be very complicated but at each intersection I have just three options.
Nebraska in 2009 created standardized tests that reflect the judgment of classroom teachers (see prior posts). The tests currently do no reflect the judgment of students. The relationship of what part teacher and student judgment plays in determining raw scores on the Nebraska tests is estimated in the chart. The contribution each makes as the scores increase is not a linear event.


Teachers must do the heavy lifting until students develop that sense of responsibility needed to learn and report at higher levels of thinking. There is a shift from being dependent to being independent. This is a basic tenant of the Common Core State Standards movement.

[The opposing view is that children naturally have the behavior of learning like a scientist: observe, question, answers, verify. This is how they learn to walk and talk and mimic adult behavior. If this natural behavior had not been suppressed in their early school years, there would be no need to bring it back into play later, in my case, working with underprepared college freshmen who often behaved like passive middle school pupils.]

Once students have observed that they can succeed, that they are self-empowered, they still need teacher judgment to guide them. We are then dealing with a student who knows and can trust what is known as the basis for further learning and instruction. The student is now free to be innovative and creative, to, in my case, elect to take part in voluntary projects and oral reports to the class. These students modeled for others in the class how to be successful: make sense of assignments, and their own questions, rather than memorize nonsense.

[The difference between expected and observed results in the sciences and engineering is referred to as error. That same variation in arts and letters is referred to as being innovative and creative. And in medicine and social affairs, life/death and promotion/prison.]

Children want to learn to ride a bicycle, a skateboard, and to swim. Learning is scary but instantly rewarding. The Educational Software Cooperative, non-profit, was formed in 1994 to promote that same environment for students on computers. Now a more advanced environment, for the same software, is available on tablets and the Internet.  When they feel prepared they can report using software that measures both knowledge and judgment at all levels of thinking (Winsteps, partial credit Rasch model; and Power Up Plus, Knowledge and Judgment Scoring). A quantity and quality scored test can sort out which students are just repeating instruction and which students are reporting what they trust they know and can do.

In the future, I expect that assessment will be such an integrated function that it will be recorded as students learn at lower levels of thinking or in the classroom at all levels of thinking. Online courses are now doing this. The classroom of the future, IMHO, will still provide safe day care, teacher moderated group learning and assessment, and software learning and assessment for individual students. Equal emphasis will be placed on students learning and on their development as self-correcting, self-motivated high quality achievers. Success on Common Core State Standards tests will require such students.

Wednesday, October 31, 2012

An Assessment Worthy of the Common Core State Standards


The Common Core State Standards go beyond just knowing, believing and guessing. It demands an assessment that includes the judgment of psychometricians, teachers, and students. For the past decade, psychometricians have dominated making judgments from statistical information. The judgment of teachers was given equal weight in 2009 in Nebraska (see prior post).

The power of student judgment needs to be discussed and a way of adding it as the third primary stakeholder in standardized testing. Currently the old alternative and authentic movements are being resurrected into elaborate time consuming exercises. The purpose is to allow students to display their judgment in obtaining information, in processing it, and in making an acceptable (creative and innovative) report.

Traditional multiple-choice scoring, that only counts right marks, is correctly not included. Students have no option other than to mark. A good example is a test administered to a class of 20 students marking four-option questions (A, B, C, and D). Five students mark each option, on one question. That question has 5 right out of 20 students or a difficulty of 25%. There is no way to know what these students know. A marking pattern of an equal number of marks on each answer option indicates they were marking because they were forced to guess. They could not use the question to report what they actually trusted they knew. Student judgment is given no value in traditional right count scored multiple-choice testing.

The opposite situation exists when multiple-choice is scored for quantity and quality. Student judgment has a powerful effect on an item analysis by producing more meaningful information from the same test questions. Student judgment is given equal weight to knowing by Winsteps (partial credit Rasch model IRT, the software many states use in their standardized testing programs) and by Power Up Plus (Knowledge and Judgment Scoring, a classroom oriented program). Scoring now includes A, B, C, D, and omit.


Eight different mark patterns are obtained, related to student judgment, rather than two obtained from traditional multiple-choice scoring, when continuing with the above example. The first would be to again have the same number of marks and omits (4 right, 4 wrong, 4 wrong, 4 wrong marks, and 4 omits). This again looks like a record of student luck on test day. I have rarely seen such a pattern in over 100 tests and 3000 students. Experienced students know to omit for one point rather than to guess and get zero points when they cannot trust using a question to report what they actually know or can do.

The next set of three patterns omits one of the wrong options (4 right, 4 wrong, 4 wrong, and 8 omits. Students know that one option is not right. They cannot distinguish between the other two wrong options (B & C, B & D, and C & D). By omitting they have uncovered this information, which is hidden in traditional test scoring where only right marks are counted.

In the second set of three patterns students know that two options are not right and they can distinguish between the remaining right and wrong options. Instead of a meaningless distribution of marks across the four options, we now know which wrong option students believe to be a right answer (B or C or D). [Both student judgment and item difficulty are at 50% as they have equal value.]

The last answer pattern occurs when students either mark a right answer or omit. There is no question that they know the right answer when using the test to report what they trust they know or can do.

In summary, quantity and quality scoring allows students of all abilities to report and receive credit for what they know and can do, and also for their judgment in using their knowledge and skill. The resulting item analysis then specifically shows which wrong options are active. Inactive wrong options are not buried under a random distribution of marks produced by forced-choice scoring.

All four sets of mark patterns contain the same count of four right marks (any one of the options could be the right answer). Both scoring methods produce the same quality score (student judgment) when all items are marked (25%). When student judgment comes into play, however, the four sets of mark patterns require different levels of student judgment (25%, 33%, 50% and 100%).

Right count scoring item difficulty is obtained by adding up the right (or wrong) marks (5 out of 20 or 25%). Quantity and quality scoring item difficulty is obtained by combining student knowledge (right counts, quantity) and student judgment (quality). Both Winsteps and Power Up Plus (PUP) give knowledge and judgment equal value. The four sets of mark patterns then indicate item difficulties of 30%, 40%, 50% and 60%.

[Abler students always make questions look easier. Measuring student quality makes questions look easier than when just counting right marks and ignoring student judgment. The concept of knowledge and judgment is combined into one term, the location on a logit scale (natural log of the ratio of right to wrong marks), for person ability (and the natural log of the ratio of wrong to right marks for item difficulty) with Rasch model IRT using Winsteps. The normal scale of 0 to 50% to 100% is replaced with a logit scale of about -5 to zero to +5.]

Quantity and quality scoring provides specific information about which answer options are active, the level of thinking students are using, and the relative difficulty of questions that have the same number of right marks. IMHO this qualifies it as the method of choice for scoring Common Core State Standards multiple-choice items (and for preparation for such tests).

Forced guessing is no longer required to obtain results that look right. Experienced students prefer quantity and quality scoring. It is far more meaningful then playing the traditional role of an academic casino gambler.

Wednesday, October 24, 2012

Nebraska Student Assessment Four Star Rating


Nebraska is now poised for a fifth star. It has a standardized assessment that ranks, that measures, that produces meaningful student test scores, and is deeply influenced by the judgment of experienced classroom teachers. Nebraska is the first state I have found that has transparently documented that it is this close to an accurate, honest, and fair assessment.

Five critical features can rate the standardized tests used by state departments of education over the past ten years. These tests have evolved from just ranking students, teachers, and schools based solely on the judgment of psychometritians to including the judgment of teachers. 

A five star rating would include the judgment of students to report what they know accurately, honestly, and fairly instead of guessing at the best answer to each multiple-choice question.

A test can earn three stars based on the judgment of psychometritians, one star on the judgment of teachers, and one star on the judgment of students. These are the three main stakeholders in doing a standardized multiple-choice test.

There are other stakeholders who make use of and market the test results. These secondary stakeholders often do not market the true nature of the standardized test in hand. Their claims may not match the test results.

ONE STAR: Any standardized multiple-choice test earns one star. The norm-referenced test compares one student with another. Raw test scores are plotted on a distribution. A judgment is then made where to make the cut scores. Many factors can be used in making this judgment. It can be purely statistical. It can attempt to match historical data. It can be a set portion for passing or failing. It can be whatever looks right. The cut score is generally marketed with exaggerated importance.

TWO STARS: A criterion-referenced test earns two stars. This test contains questions that measure what needs to be measured. It does not compare one student with another. It groups students with comparable abilities. Nebraska uses below standard, meets standard, and exceeds standard. This divides the score distribution into three regions. Cut scores fall at the point a student has an equal chance of falling into either region. The messy nature of measuring student knowledge, skill, and judgment is transparent. Passing is preparing to meet the standard set for the median of the meets standard region, not just preparing to be just one point above the cut score.

THREE STARS: The much-decried right count scored multiple-choice test performs best with higher test scores than lower test scores. Right marks on tests scored below 60% are questionable. Test scored below 50% are as much a product of luck on test day as they are of student ability. We can know what students do not know. Psychometricians like test scores near 50% as they lend stability to the test data. Nebraska designed its test for an average test score of 65% plus questions needed to cover the blueprint requirements for a criterion-referenced test. The Nebraska standardized 2010 Grade 3 Reading test produced an average score of 72%. Nebraska can know what students do know about 3/4 of the time: Three stars.

FOUR STARS: Nebraska earns a fourth star of including teacher judgment in writing questions, in reviewing questions, and in setting the criterion-referenced standards. The three regions (below, meets, and exceeds standards) have meaning beyond purely statistical relationships. It was teacher judgment that moved the test design from an average score of 50% to 72%. The scores now look very much like those produced by any good classroom test. They can be interpreted and used in the same way.

FIVE STARS: Nebraska has yet to earn a fifth star. That requires student judgment to be included in the assessment system. When that is done, Nebraska will have an accurate, honest, and fair test that also meets the requirements of the Common Core State Standards.

Most right marks will also represent right answers instead of luck on test day (less churning of individual test scores from year to year). The level of thinking used by students on the test and in the classroom can also be obtained. All that is needed is giving students the option to continue guessing or to report what they trust they know.

*   Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Including student judgment will add student development (the ability to use all levels of thinking) to the Nebraska test. Students need to know and do, but also who have experienced judgment in applying knowledge and skills in situations different from those in which they learned.

Routine use of quantity and quality scoring in the classroom (be it multiple-choice, short answer, essay, project, or report) promotes student develop. It promotes the sense of responsibility and reward needed to learn at all levels of thinking (passive pupils become active self-correcting learners). IMHO if students fail to develop this sense of responsibility the Common Core State Standards movement will also fail

Software to do quantity and quality scoring has been available for over two decades. Nebraska is already using Winsteps. Winsteps contains the partial credit Rasch model routine that scores quantity and quality. 

Power Up Plus (PUP) scores multiple-choice tests by both methods: traditional right count scoring and Knowledge and Judgment Scoring. Students can elect which method they are most comfortable with in the classroom and in preparation for standardized tests.

Starting in 2005, Knowledge Factor has a patented learning system that guarantees student development. High quality students generally pass standardized tests. All three programs promote the sense of responsibility and reward needed to learn at all levels of thinking, a requirement to meet the Common Core State Standards.

Newsletter (posted 23 OCT 2012): http://www.nine-patch.com/newsletter/nl5.htm


General References:

Roschewski, Pat (online 25 OCT 2005) History and Background of Nebraska’s School-based Teacher-led Assessment and Reporting System (STARS). Educational Measurement: Issues and Practice, Volulme 23, Issue 2, pages 9-11, June 2004. http://onlinelibrary.wiley.com/doi/10.1111/j.1745-3992.2004.tb00153.x/abstract# (accessed online 6 Oct 2012)

Rotherham, Andrew J. (July 2006) Making the Cut: How States Set Passing Scores on Standardized Tests. http://castle.eiu.edu/dively/documents/evaluatingstudentachievements/EXPCutScores.pdf (accessed online 6 Oct 2012)