Wednesday, December 21, 2011

True Score Diviner


The previous few posts have listed weaknesses in traditional multiple-choice right mark scoring (RMS). Other than a rank of increasingly questionable value as the test scores decrease, RMS results are seriously flawed for use in current formative assessments. Quality and quantity are still linked. They are not linked in projects, reports, and essay tests. Even on a failing project, there can still be a note, “Great use of color”; “Great idea, another bit of editing and a great paper”.

Knowledge and Judgment Scoring (KJS) does the same thing with multiple-choice tests, “You got a quality score of 90% on the test you select to mark. Now make the same preparation on more of the assignment and you will have a passing score. We know you can do it!”

RMS test scores are always suspect and often meaningless. The True Score Diviner can help you find your true score, or if your score is your true score, the range of test scores you may have gotten with the same preparation. 

At 100%, your test score and true score are one and the same. With a test score of 25%, on a 4-option question test, your true score could range from 25 - 25 or zero to 25 + 25 or 50%. Half of the time RMS cheats you and half of the time it teases or lies to you. You have a lucky day or an unlucky day. There is no way to know which or how much from a single test. Statistical procedures say very little about single events strongly related to luck. They can help if you took about five versions of the test and calculated an average test score. You do not do that.

Knowledge and Judgment Scoring (KJS) solves this problem by letting you report what you know and trust. You are, in effect, scoring your own test based on your own preparation. Each student gets a customized test. Guessing is not required.

Now both student and teacher know what every high quality student knows and can do that can be trusted as the basis for further instruction, learning, and application regardless of the test score. Quantity and quality each generate separate scores.

We need to promote Knowledge and Judgment Scoring (KJS). Power Up Plus (PUP) does this by offering students both RMS and KJS. They make the switch when they have matured enough in a supportive classroom that places equal emphasis on knowing and on the skills required by the successful independent achiever. I know. I don’t know. I know how to know.

RMS today makes as much sense as selling gasoline at $3 per gallon from a pump that averages one gallon for each $3. It may deliver less than ½ gallon to over two gallons for each $3. But it does deliver an average of $3 per gallon if you sum all the customers for the day. That is a range of less than $1.50 to over $6.00 per gallon. Such a situation in academic measurement still goes, for the most part, unquestioned.

Wednesday, December 14, 2011

Is Student Debriefing Hacking?


“How did the test go?”  “Fine.” This common exchange is heard after every standardized test. It does not disclose the content of the test, the questions on the test, or the score. It is more wish than fact. It reveals nothing that is meaningful to student and teacher; a frequent end result of NCLB standardized testing.

The current trend in revising the Elementary and Secondary Education Act (ESEA) is to add tests within the course to the final test. This is promoted as formative testing. Unfortunately formative testing requires timely feedback. Computers can provide non-judgmental timely feedback. This gave rise to the Educational Software Cooperative, Inc, non-profit. Learning at higher levels of thinking (question/answers/verify) provides effective self-motivating feedback. A standardized test that only returns a test score several weeks later has little if any formative testing content.

The new within-course tests are actually an expansion of predatory testing. Predatory testing crowds out instructional/learning time. It unfortunately encourages lengthy test preparation at lower levels of thinking by the very schools that most need higher levels of thinking instructional/learning time. It encourages a short-term fix rather than a long-term solution (rote over understand).

The classroom teacher has several options:
  1. Devote little, if any, time to test preparation. Conduct the classroom in such a manner that the standardized test is, as knowledgeable students put it, “No big deal.”
  2. Prepare students to take the test at higher levels of thinking by using Knowledge and Judgment Scoring (KJS) on projects and classroom essay and multiple-choice tests.
  3. Continue lengthy test preparation at lower levels of thinking (which in my opinion should be outlawed; recognized as a trait of incompetent school administration).

One way of making ESEA standardized tests function as formative assessments is to debrief students shortly after the test. High scoring classes can do this very informally for the first teacher option above.

Less successful classes, at higher levels of thinking, can collect the topics students find puzzling. High quality students have good judgment in determining what they know and what they have yet to learn.

At lower levels of thinking, students and teachers are most interested in the right answer for each question: A or B or C or D. Debriefing at this level, in my opinion, is as meaningless as reading off the answers to an in-class test.

Each of the above levels penetrates closer to the actual question stem and answer options. The concept of “fair use”, when applied to standardized test questions, requires that whatever is done, it must not reduce the market value of the test. It must not be for profit. It must only be of benefit to the participating students. The actual test questions must not be discussed. They must remain secret. Debriefing is then restricted to a one-time affair. Debriefing is of decreasing value to students performing at higher levels of thinking down to lower levels of thinking.

Student debriefing is hacking:

  1. It is a violation of copyright. (Fair use of copyrighted material does not include disclosing or direct copying of a standardized test question. A standardized test question is used to make comparative assessments [the common items must be protected]. By its very nature, it must be kept secret or its market value is affected. What portion can be copied or referenced is open to interpretation*.)
  2. It promotes the sale of test question answers. (Informal and higher levels of thinking debriefing do not require the exact question stem nor the question answers. Any attempt to recall exact question stems and answers is of limited use as good standardized tests scramble the answer options, edit the question stems, and replace a portion of the questions between each test. Computer adaptive tests [CAT] do much of this during each student application – no two students even get the same test.)

Student debriefing is not hacking:

  1. It makes a formative assessment out of predatory testing.
  2. Debriefing with a test company provided summary lesson plan, listing topics with model test questions, would not be hacking.  For a test of 30 questions covering 6 topics, the 6 topics could be listed with a model question for each topic.  The model questions could be ones released from past tests. In-class scoring of this summary test would provide immediate feedback for students and teachers. This formative assessment lesson plan would increase the test’s market value.

*At one extreme, the Georgia Professional Standards Commission bands any mention, reference to, or discussion of test questions. Students take the test and close the booklets. The closed booklets are collected and returned.

At the other extreme, parents of students who have learning problems can view the test booklets. This is justified as “fair use” as it provides parents some idea of what the student should have been able to do.  It is of help in educating the student. It is not for profit. This one time use applies to no one other than to the parent/school/student relationship. It is therefore not a breach of security.

Wednesday, December 7, 2011

Is Wallpapering Hacking?

Hacking, in the beginning, was an honorable tradition of learning how to control and use a computer, for something useful, without having access to machine and language manuals. It was playing (question/answers/verify; just as is done in putting a puzzle together). It was pioneering. It was empowering. It was fun. Over time, “hacking” became all of the above, but with malice intent. A few bad apples tarnished the image of the bright and the bold.

Dumb wallpapering, marking the same option, “C” for example, if you do not know, does not improve test results or student scores. Smart wallpapering, creating a unique answer pattern PRIOR to seeing the test, yields improved KJS results. It can rather uniformly alter student scores.


When the wallpaper contains a right answer, everyone who uses the wallpaper mark gets a right answer. This holds for low quality and high quality students. This is fair. The class, the team, wins or loses together. This is the same level that standardized Dumb Testing data makes sense in ranking classes and schools.

Wallpapering reduces test stress by reducing the time and effort wasted on trying to find the “best answer” to a question you cannot read or understand, let along have nothing in mind for an answer.

Wallpapering is hacking:

  1. It restricts a wrong mark to one option per question. (The mathematical model for Dumb Testing assumes that a student randomly marks wrong answers. This is not true. The model also assigns the starting test value to zero. This is not true. On a 4-option question test, the starting value is 25%, on average.)
  2. Students are acting in collusion. (It makes no difference if individual students decide before the test or during the test what option to make for a forced mark. Wallpapering requires the selection to be made BEFORE seeing the test.)

Wallpapering is not hacking:

  1. It only formalizes the advice students have been given for decades: “Mark ‘C’ if you cannot select a ‘best’ answer”.
  2. It does not change Dumb Testing standardized test scores. 

Wednesday, November 30, 2011

Smart Wallpaper Testing

The idea for wallpaper came from a simple fact. Students need protection from predatory testing. Know or not, they must mark an answer to each question. Birds fly in flocks and fish swim in schools. They do the same thing at the same time to avoid predators. Wallpaper lets students mark the same option when they cannot use the test to report what they know.

Two wallpaper patterns can be used to extract higher levels of thinking (Smart Testing) information. Dumb wallpaper is based on one of the answer options. Smart wallpaper can be based on the most frequent wrong mark for each question, for example.  Dumb wallpaper pays no attention to student performance. Smart wallpaper is based on expected student performance.

Wallpaper extracts higher levels of thinking (Smart Testing) information using Knowledge and Judgment Scoring (KJS). The assumption is that students omit or use the wallpaper pattern when not using the question to report what is known and trusted. This can be seen in the progression from KJS without wallpaper, Table 3bST, 


KJS with Dumb wallpaper, Table 3bSD,








and KJS with Smart wallpaper, Table 3bSS.




The student counseling mark matrix analysis (the test taker view of the test) changes from nonsense, to a better performance with Dumb wallpaper, to a typical Knowledge and Judgment Scoring (KJS) printout with Smart wallpaper.

Test scores increase as the simulated quality increases. The distributions (Standard Deviations) of scores and item difficulty decrease. Test reliability declines!  Oops!  “Houston, we have a problem!” Test companies optimize (brag about) their test reliability based on poor quality data. KJS optimizes student judgment to produce accurate, honest, and fair data.


This table clearly captures this conflict in numbers. High test reliability is needed to obtain similar consecutive average test scores. It follows the lower the quality of student scores and the lower the average test score, the more chance determines the average test score. It is also known that the normal curve is highly reproducible by chance alone. High test reliability can become an artifact of test design rather than student performance.

To the fact that the starting score on a multiple-choice test is 1/(number of options) rather than zero, we can now add a second form of self-deception (psychometricians refer to these as simplifications). They made some sense when everything was done with paper and pencil. Today there is no need to still lock quality and quantity together on a multiple-choice test, especially now that one (KJS) can measure what students actually know and trust rather than just rank students (RMS).

The misconceptions in Table 3bST are artifacts created by forcing students to mark when they have no answer of their own. They were not given the option to omit (to mark an accurate, honest and fair answer sheet). Table 3bSS, using Smart wallpaper, shows all four groups of questions (expected, discriminating, guessing, and misconception – EDGM). Higher quality students earn higher test scores that are more accurate, honest and fair.
                       
The scores in Table 3bSS are only obtainable if students omitted instead of marking the most frequent wrong mark for each question. This simulation fails to capture what students would actually do, if given the opportunity to only mark, when marking reports something they know and trust (can confirm).  Given that opportunity, some quality scores would be higher and some lower. Also there is no way to know which wrong mark will be the most frequently marked for each question. Wallpaper must be created BEFORE the test, not after the test.

This simulation again demonstrates there is no way of equating RMS and KJS results from one set of data. To know what students actually know they must be give the opportunity to report what they know that is meaningful and useful as the basis for further learning, instruction, and use on the job. Traditional RMS only does this when test scores are near 90%. Knowledge and Judgment Scoring (Smart Testing) yields a valid quality score (%RT) for every test score, a valid test score for every high quality (%RT) student performance.

Saturday, November 26, 2011

Wallpaper Modified Testing

The minimum requirement for traditional multiple-choice tests is to mark one option on each question, right mark scoring (RMS). The student is not given the option to omit. The test score indicates luck, guessing, and what the student may know. The score only ranks the student. After experiencing Knowledge and Judgment Scoring (KJS) my students called traditional testing Dumb Testing. Dumb Testing is easy and fast. Reading all the test questions is optional.

Smart Testing (KJS) requires that each question stem is read and visualized (a web of relevant relationships) before looking at the answer options. If the student’s answer matches one of the answer options, that option is probably the right answer for the question. The student has brought something to the test that can be reported using this question.

Knowledge and judgment can be given equal value. The test score is a combination of the knowledge and judgment scores (the quantity and quality scores). Forced guessing is not required. The result is an accurate, honest and fair test score.

Changing from Dumb Testing to Smart Testing requires some experience. This is much like changing from a tricycle to a bicycle. It is scary the first few times. After that it is fun. Over 90% of students voluntarily switch from Dumb Testing to Smart Testing after two experiences.

Until Smart Testing is offered on NCLB standardized tests, there is a way to modify Dumb Testing to obtain Smart Testing information. It comes from wallpapering the answer sheet. It requires a third key (WP KEY   ) for the wallpaper.

The trick is to assign one answer option on each question as the “omit” option BEFORE seeing the test. Students mark only if they can trust the answer to be correct. Instead of, “mark a best answer on each question”, now students only, “mark answers you can use to report what you trust you know or can do”. Near the end of the test, they fill in the remaining marks following the wallpaper design.

The simplest design is the age-old advice: “If you do not know an answer, just mark C”. Any letter option can be selected for the class PRIOR to seeing the test. 

The next most frequent design students have used is the “Christmas tree”: ABCDABCD . . .  and AABABCABCDAAB . . .. Random designs can be used if the pattern is posted for all the students to use at the end of the test.

Wallpapering does not change Dumb Testing (RMS) test scores. Changing a wrong mark to a wallpapered omit is still “wrong” with traditional right mark scoring (RMS).  

Right Mark Scoring clicker data with no wallpaper.   








Right Mark Scoring clicker data with Dumb wallpaper (based on any single answer option).






Right Mark Scoring clicker data with Smart wallpaper (based on student judgment).







Commercial testing companies can still score the tests to produce traditional Dumb Testing student and school rankings.

Wallpapering does change Smart Testing (KJS) test scores. Power Up Plus (PUP) then extracts quantity and quality Smart Testing values (including test maker and test taker views). (See next post.)

Wednesday, August 24, 2011

Standardized Testing - Structure, Function, and Operation

The structure, function and operation of standardized testing must all be considered when evaluating the usefulness of test results. Standardized test results are not always what they are claimed to be. When mixed with politics, they usually have even less value, as will be discussed near the end of this post.

Standardized testing involves test score distributions (statistical tea leaves). Their two most easily recognized characteristics are the average score, or mean, and the spread of the distribution, or standard deviation (SD).

Two methods of obtaining score distributions are now in use. The traditional method, counting right marks on a multiple-choice test, is the same as used on most classroom tests. The Rasch model method, used by many state education departments, converts test results to estimated measures of student ability and of item difficulty.

The value of multiple-choice test results depends upon how the test is administered. Both methods allow for two modes: forced choice, mark every answer, and student choice of items that can be used to report what the student trusts that is useable as the basis for further learning and instruction.

The following table relates the above four combinations to two software programs, the fixed, reproducible, structures that produce score distributions. Power Up Plus (PUP) and Winsteps produce score distributions for classroom use and for standardized testing. 

Three of the four modes produce traditional right count quantitative score distributions: Quantity Scores. KJS adds a quality score that is comparable to the full credit mode measure distribution.


The distribution of scores from a traditional multiple-choice test can be a good indicator of classroom performance (teacher and student). As a standardized test, only counting right marks places as much, if not more, emphasis on test performance as on student performance. Items are carefully selected to produce a predicted score distribution. This score distribution is expected to match some subjectively set standard (cut score) such as grade level or job readiness. But how the test is administered changes the value and meaning of key functions. Forced choice and student choice produce two different views from the same students.


For many historical reasons, including tradition and short-term accountability, NCLB has used the forced choice mode that only assesses and promotes the lowest levels of thinking. It is fast, cheap, and ineffective. Testing, and unfortunately as a result teaching, limited to the lowest levels of thinking is more counter productive the longer students are exposed to it. This may be an underlying factor in the poor showing made by high school students, in general, in relation to the lower grades (the spread between the levels of thinking required and that seniors possess my contribute to the current emphasis on senior attitude).

When students are allowed to report what they trust as a basis for further learning and instruction, a wealth of information becomes available for student counseling to direct student development. PUP allows students to switch from forced choice to reporting what they know when they are comfortable doing so.  Knowledge Factor is a patented instructional/assessment system that guarantees mastery learners. Development to use all levels of thinking is critical to success in school and in the workplace.


Many ways of operating standardized testing have been used in assessing students for NCLB. Multiple-choice was derided at first and then returned as the primary method. Almost everything that is not assessed by actual performance can be usefully measured with multiple-choice (A, B, C, D and omit). Traditional multiple-choice was crippled by dropping the option to omit (don’t know) early on. Just counting right marks was easier and gave a useable ranking for grading. How the rank relates to what a student knows or can do is still an open debate. Knowledge and Judgment Scoring settles this matter with a quality score.

A test maker (teacher or standardized item author) has all of the above structure and function options to consider when creating an operational test. The value of the final test results depends upon how the options are mixed and handled (a simple ranking or an assessment of what is known and can be done along with the judgment to use it well).

Test banking can be very simple. It can be a list of 25 questions that is edited each semester. The test is then scored by any one of the above four modes. The choice depends upon the use of the results. RMS ranks students and permits comparing your success from year to year. KJS and the partial credit Rasch model explores which students are still lingerers, followers and self-directed learners. The quality score can point out what each student knows or can do as the basis for further learning and instruction regardless of the test score.

Test banking can be very complicated, time consuming, and expensive.  Winsteps appears to be about the least complicated, the least time consuming and the least expensive way for standardized testing. It has been used by many states.

A test bank is created from items that have been calibrated by Winsteps. A high scoring sample will produce items with low difficulty. A low scoring sample will produce items with high difficulty. Equating, with the use of a set of common items, can bring these together if the two samples are believed to be from the same population. Winsteps does not know to do this on its own. When and how to equate requires an operational decision.


However the operations are carried out, human intervention is needed to start it and thereafter at about every other step. Standardized testing is still a mix of art, science and politics.  

A benchmark test is selected from the test bank. A range of item difficulties is selected to match the population to be assessed. A small common item set is included. The mean and standard deviation of the predicted distribution are calculated.  Time and money permitting, the benchmark test is administered one or more times. Now a known mean and standard deviation are in hand for the distribution. This ends research.

An application test is administered to the full population: every Algebra I student in the state, for example. This operational test also contains a set of common items used in creating the benchmark test. Winsteps scores the application test.

Resolution of the test results is not the same as equating items for a test bank. Winsteps can be used here in the same manner as in test banking, but the environment is now very different. A pre-application public declaration of cut scores is no longer recommended due to newly found (Feb 2011) sources of score instability. If the operational test has not performed as expected, the needed adjustment can favor the desired performance for the average score, the cut score, the scaled score, the percent passing, or the percent improvement. Public exposure of average scores has been requested by the Center on Education Policy (CEP), Open letter to the member states of PARCC and SBAC, May 3, 2011. Everyone can then know the starting point for whatever resolution adjustments are made. This would help reestablish public trust and increase the value of test results.

Test banking data can be liberally culled to obtain the best fit of data to the Rasch model because of the unique properties of the model. That same liberal attitude is, in my opinion, not justified when manipulating the operational test results.

The final step for Winsteps is the conversion of measures to expected raw scores. The conversion is a matter of changing log units to normal units when the test results are not manipulated. No human judgment is required. A normal bell curve distribution is again created.

This brings this series of posts related to the high jinks exposed in several state education departments to an end. Over the past few years several states have displayed marked deficiencies in their short-term competition for federal money and adequate yearly progress (AYP) including Texas and Illinois (part of the motivation for this year long investigation into Rasch model IRT test analysis). During this last year New York presented the worst example I know of. In my opinion the recent cheating scandals in Georgia will have done less damage to students, teachers and schools than the manipulation of New York state test results by state officials.

Arkansas, on the other hand, has posted almost perfect examples for AYP on NCLB tests for over a ten-year period: 2001-2011 End-of-course Comparison.

(The percent combined proficient and advanced is a derived value. Average test scores, and related cut scores, are based directly upon student marks on the test.)

This demonstrates exceptional skill in managing test performance. Such a performance has therefore invited suspicions of the test becoming more standardized on test performance (the test score) than on student performance (what students know and can do). Were that to be true, it would make Arkansas a good case of successful well-intentioned self-deception, created by instruction (curriculum), learning (level of thinking) and assessment (test items) being optimized for NCLB test results. These doubts are probably not valid given the awards won and leadership demonstrated by Arkansas. Comparison with NAEP also shows that two different views of the same students can vary a great deal. Both views may be validated with sufficient student performance information to clarify what each test is testing. Arkansas has also equated classroom and state test scores as part of their management of grade inflation (again, two views of the same students).

Replacing the national academic lottery conducted with right count scored tests with tests that actually assess what students know and can do, as the basis for further learning and instruction, is one way of clarifying this situation (Knowledge and Judgment Scoring and the partial credit Rasch model, for example). The same tests now used for ranking can also be used when upgrading classroom testing (to assess both quantity and quality) to better prepare students for whatever forms of questions are used on the new NCLB tests. There is a great increase in useful information for students and teachers to direct classroom assignments and activities at all levels of thinking. Or replace the classroom with a complete instruction/assessment package like Amplifire.

The spread of certified competency-based learning may help bring about the needed change in assessment methods. A test must measure what it claims it is measuring. The test results must not be subject to a variety of secretive factors that only delay the inevitable full disclosure. “You can fool part of the people (including yourself) part of the time, but not all of the people all of the time.” The software packages are honest. It is how they are used that is open to question.

Wednesday, August 10, 2011

Grading Clicker Data

The clicker data provided by GMW11 can be assigned grades in many ways. A traditional multiple-choice curve used by GMW11 produced 3 A, 1 B, 24 C, 24 D, and 69 F grades with an average score of 34%.

A typical Knowledge and Judgment Scoring (KJS) distribution, with letter grades set every ten percentage points, would be 1 B, 3 C, 13 D, and 104 F grades. A KJS curve comparable to a right mark scoring (RMS) curve yields 4 A, 3 B, 15 C, 31 D, and 68 F grades with an average score of 49%. The same number, 69 and 68, are passing on each test.

Comparable curving produced similar grade distributions. However, what is being assessed and rewarded is very different. A RMS curve is based on a student’s luck on test day (both in marking and in the selection of questions presented on the test). A KJS curve is based on each student’s self-assessment, it combines knowledge and judgment in selecting questions to use to report what is actually known. Top students earn the same grades by both methods, as do most poor students.

High quality, self-assessing, students earn a reward for reporting what they can trust as the basis for further learning and instruction. The sharper the incline connecting RMS and KJS scores on the chart the higher the quality. High quality students are teachable. KJS identifies them. RMS does not.

By scoring the clicker data by both methods and curving the scores in the same manner, the difference in student performance on the two scoring methods is clearly exposed. The task of the RMS student is to mark the best guess of a right answer for each question. Understanding, problem solving, and reading ability are secondary and even, at times, unnecessary. These are all crucial for a KJS student to determine if a question can be used to report something that is understood or which has sufficient relationships with other information or skills that a verifiable right answer can be marked.

In this day, all multiple-choice tests should offer both methods of scoring. Students can easily switch from lower to higher levels of thinking; from little responsibility to near full responsibility for learning. Successful implementation requires letting students make the switch. Forcing students into KJS is about as unproductive a thing to do as forcing them to mark an answer to every question on a test they cannot understand or at times even read. Power Up Plus scores both methods, as does Winsteps (full credit and partial credit Rasch IRT models). No additional preparation time or effort is needed beyond that required for creating any multiple-choice test.

To the student: Your highest score/grade is obtained by being honest in reporting what you know, understand, and can trust at any level of preparation.

To the teacher: You know what each student can do and understand as the basis for further learning and instruction.

To the administrator: You know the levels of thinking, for each student, and in classroom instruction, as passive pupils prepare to be independent learners (self-assessing, self-correcting scholars).

Knowledge and Judgment Scoring promotes student development when used on essay tests, multiple-choice tests, and I would suggest the same for clicker data.

Wednesday, August 3, 2011

Scoring Clicker Data

I was recently presented with some clicker data to examine (GMW11). It had been scored by traditional right count scoring. There were a number of scores below 20%. There were even four students with a score of zero. That is way below the average guessing score when using five options on each question.

A different score distribution was produced by scoring the data for both knowledge and judgment. This distribution looks very much like what one would expect from students on their first introduction to Knowledge and Judgment Scoring. Students earn the same scores by both methods of scoring when they fail to exercise their own judgment (mark an answer to every question).The top three students therefore obtained the same score with both methods of scoring.

Here is an opportunity to compare Right Mark Scoring (RMS) and Knowledge and Judgment Scoring (KJS) when used on any multiple-choice test. There is one catch, both methods of scoring are being used on one, the same, set of answer sheets.

Normally students would elect which method they felt comfortable using (and if time permits, on the first or second test, they may fill out two answer sheets, one for each method of scoring). The same test data can support a number of different stories. This story will assume that the test was presented with a choice of RMS and KJS, and further, that this was the first such test for the class. Most would be expected to select what they are most familiar with: RMS.


When quantity and quality are scatterplotted from RMS data, the result is a straight line. Only one dimension is being measured: a count of right marks.
 
KJS data are two-dimensional. A range of quality scores can yield the same test score. The test score of 46% was earned by students with a range of quality scores from zero to 44%.

Higher quality students are found above 50%. Lower quality students are found below 50%. Higher quality students get higher scores by marking more right answers. Only one student marked a perfect 100% quality score (no wrong marks).

Lower quality students get lower scores by marking more wrong answers. Four marked a zero quality score (every one of their two to 10 marks on the 23 question test was wrong).

Quantity and quality have been given equal value. The active test score then starts at 50%: 1 point for right, 1 point for good judgment (omit or right), and zero for wrong (poor judgment). [Back in the 1970s, when this work first began, the active test score started with zero. It was called net yield scoring; right minus wrong. The discovery of the quality score produced the second dimension that assesses student performance rather than defaulting to luck on test day.]

The end result of training students to accurately report what they trust they know or can do is shown in the Fall88 scatterplot. After an initial test (such as the clicker data) where most students elect RMS, they change study habits, and voluntarily switch to KJS. Here most of the class show a quality score about one letter grade higher than their test score. There is a bit of a disconnect at the pass/fail line of 60% (70% C, 80% B, and 90% A). Experienced students feel more comfortable reporting what they know than guessing at answers on all items on the test. They are on the path to being independent learners (self-correcting scholars).

This is in contrast to traditional right mark scoring where any score can be one letter grade higher with good luck to one letter grade lower with bad luck than a student’s actual ability. A grade of B one day may be a D on another day. And no one, including the student, knows what the student actually knows and can do as the basis for further learning and instruction.

Grading has an important effect on which scoring method students select: RMS (“I mark, you score”) or KJS (self-assessment, “I tell you”). RMS students tend to cram and to match. KJS students bring a rich web of relationships (from learning by questioning, answering, and verifying) that they can apply to questions they have not seen before. There is an operational difference between remembering and understanding that can be measured (RMS vs. KJS).