Wednesday, January 14, 2015

Meaningful Multiple-Choice Test Scores

The meaning of a multiple-choice test score is determined by several factors in the testing cycle including test creation, test instructions, and the shift from teacher to student being responsible for learning and reporting. Luck-on-test-day, in this discussion, is considered to have similar effects on the following scoring methods.

[Luck-on-test-day includes but is not limited to: test blueprint, question author, item calibration, test creator, teacher, curriculum, standards; classroom, home, and in between, environment; and a little bit of random chance (act of God that psychometricians need to smooth their data).]                             

Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part multiple-choice, and knowledge and judgment scoring (KJS) for both short answer and multiple-choice. These range from familiar manual scoring to what is now easily done with KJS computer software. Each method of scoring has a different starting score with a different meaning. The average customary class room score of 75% is assumed (60% passing).

Chart 1/4

Open ended short answer scores start with zero and increase with each acceptable answer. There may be several acceptable answers for a single short answer question. The level of thinking required depends upon the stem of the question. There may be an acceptable answer for a question both at lower and at higher levels of thinking. These properties carry over into KJS below.

The teacher or test maker is responsible for scoring the test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4). The quality of the answers can be judged by the scorer and may influence which ones are considered right answers.

The open ended short answer question is flexible (multiple right answers) and with some subjectivity; different scorers are expected to produce similar scores. The average test score is controlled by selecting a set of items that is expected to yield an average test score of 75%. The student test score is a rank based on items included in the test to survey what students were expected to master, to group students who know from those who do not know each item, and items that fail to show mastery or discrimination (unfinished items for a host of reasons including luck-on-test-day above). 

The open ended short answer question can also be scored as a multiple-choice item. First tabulate the answers. Sort the answers from high to low count.  The most frequent answer, on a normal question, will be the right answer option. The next three ranking answers will be real student supplied wrong answer options (rather than test writer created wrong answer options). This pseudo-multiple-choice item can now be printed as a real question on your next multiple-choice test (with answers scrambled).

A high quality student could also mark only right answers on the first pass using the above test (Chart 1/4) and then finish by just marking on the second pass to earn a score of 60%. A lower quality student could just mark each item in order, as is usually done on multiple-choice tests, mixing right and wrong marks, to earn the same score of 60%. Using only a score after the test we cannot see what is taking place during the test. Turning a short answer test into traditional multiple-choice hides student quality, the very thing that the CCSS movement is now promoting.
Chart 2/4

Closed ended right-count four-option multiple-choice scores start with zero and increase with each right mark. Not really!! This is only how this method of scoring has been marketed for a century by only considering a score based on right-counts after the test is completed. In the first place traditional multiple-choice is not multiple-choice, but forced-choice (it lacks one option discussed below). This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This evil flaw in test design was countered, over 50 years ago, by a now defunct “formula scoring”. After forcing students to guess, psychometricians wanted to remove the effect of just marking! It took the SAT until March of this year, 2014, to drop this “score correction”. 

[Since there was no way to tell which right answer must be changed for the correction, it made no sense to anyone other than psychometricians wanting to optimize their data reduction tools, with disregard for the effect of the correction on the students taking such a test. Now that 4-option questions have become popular on standardized tests, a student who can eliminate one option can guess from the remaining three for better odds on getting a right mark (which is not necessarily a right answer that reflects recall, understanding, or skill).]

The closed ended right-count four-option multiple-choice question is inflexible (one right answer) and with no scoring subjectivity; all scorers yield the same count of right marks. Again, the average test score is controlled by selecting a set of items expected to yield 75% on-average (60% passing). However, this 75% is not the same as that for the open ended short answer test. As a forced-choice test, the multiple-choice test will be easier; it starts with a 25% on-average advantage. (That means one student may start with 15% and a classmate with 35%.) To further confound things, the level of thinking used by students can also vary. A forced-choice test can be marked entirely at lower levels of thinking.

[Standardized tests control part of the above problems by eliminating almost all mastery and unfinished items. The game is to use the fewest items that will produce a desired score distribution with an acceptable reliability. A traditional multiple-choice scored standardized test score of 60% is a much more difficult accomplishment than the same score on a classroom test.]

A forced-choice test score is a rank of how well a student did on a test. It is not a report of what a student actually knows or can do that will serve as the basis for further instruction and learning. The reasoning is rather simple: the forced-choice score is counted up AFTER the test is finished; this is the final game score. How the game started (25% on-average) and was played is not observed (but this is what sports fans pay for). This is what students and teachers need to know so students can take responsibility for self-corrective learning.

Chart 3/4
[Three student performances that all end up with a traditional multiple-choice score of 60% are shown in Chart 3/4. The highest quality student used two passes, “I know or can do this or I can eliminate all the wrong options” and “I don’t have a clue”. The next lower quality student used three passes, “I know or can do this”; “I can eliminate one or more answer options before marking” and “I am just marking.” The lowest level of thinking student just marks answers one pass, right and wrong, as most low quality, lower level of thinking students do. But what takes place during the test is not seen in the score made after the test. The lowest quality student must review all past work (if tests are cumulative) or continue on with an additional burden as a low quality student. A high quality student needs only to check on what has not been learned.]

Chart 4/4

Knowledge and Judgment scores start at 50% for every student plus one point for acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice). (Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right  = 2) Omitting an answer is good judgment to report what has yet to be learned or to be done (understood). Omitting keeps the one point for good judgment. An unacceptable or wrong mark is poor judgment. You lose one point for bad judgment.

Now what is hidden with forced-choice scoring is visible with knowledge and Judgment Scoring (KJS). Each student can show how the game is played. There is a separate student score for quantity and for quality. A starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge Factor sets the starting score near 75%. Judgment is far more important than knowledge in high risk occupations.]

KJS includes a fifth answer option: omit (good judgment to report what has yet to be learned or understood). When this option is not used, the test reverts to forced-choice scoring (marking one of the four answer options for every question).

A high quality student marked 10 right out of 10 marked and then omitted the remainder (in two passes through the test) or managed to do a few of one right and one wrong (three passes) for a passing score of 60% in Chart 4/4. A student of less quality did not omit but just marked for a score of less than 50%. A lower level of thinking, low quality student marked 10 right and just marked the rest (two passes) for a score of less than 40%. KJS yields a score based on student judgment (60%) or on the lack of that judgment (less than 50%).

In summary, the current assessment fad is still oriented on right marks rather than on student judgment (and development). Students with a practiced good judgment develop the sense of responsibility needed to learn at all levels of thinking. They do not have to wait for the teacher to tell them they are right. Learning is stimulated and exhilarating. It is fun to learn when you can question, get answers, and verify a right answer or a new level of understanding; when you can build on your own trusted foundation.

Low quality students learn by repeating the teacher. High quality students learn by making sense of an assignment. Traditional multiple-choice (TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards all-levels-of-thinking. TMC requires little sense of responsibility. KJS rewards (encourages) the sense of responsibility needed to learn at all levels of thinking.

1.     A short answer, hand scored, test score is an indicator of student ability and class ranking based on the scorer’s judgment. The scorer can make a subjective estimate of student quality.

2.     A TMC score is only a rank on a completed test with increased confounding at lower scores. A score matching a short answer score is easier to obtain in the classroom and much more difficult to obtain on a standardized test.

3.     A KJS test score is based on a student, self-reporting, estimate of what the student knows and can do on a completed test (quantity) and an estimate of the student’s ability to make use of that knowledge (judgment) during the test (quality). The score has student judgment and quality, not scorer judgment and quality.

In short, students who know that they can learn (get rapid feedback on quantity and quality),who want to learn, enjoy learning (see Amplifire below). All testing methods fail to promote these student development characteristics unless the test results are meaningful, easy to use by students and teachers, and timely. Student development requires student performance, not just talking about it or labeling something formative assessment.  

Power Up Plus (PUP or PowerUP) scores both TMC and KJS. Students have the option of selecting the method of scoring they are comfortable with. Such standardized tests have the ability to estimate the level of thinking used in the classroom and by each student.  Lack of information, misinformation, misconceptions and cheating can be detected by school, teacher, classroom, and student.

Power Up Plus is hosted at TeachersPayTeachers to share what was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual teachers who want to upgrade their multiple-choice tests for formative, cumulative, and exit ticket assessment. Good teachers, working within the bounds of accepted standards, do not need to rely on expensive assessments. They (and their students) do need fast, easy to use, test results to develop successful high quality students.

I hope your students respond with the same positive enthusiasm that over 90% of mine did. We need to assess students to promote their abilities. We do not need to primarily assess students to promote the development of psychometric tools that yield far less than what is marketed.

A Brief History:

Geoff Masters (1950-    )   A modification of traditional multiple-test test performance.

Created partial credit scoring for the Rasch model (1982) as a scoring refinement for traditional right-count multiple-choice. It gives partial credit for near right marks. It does not change the meaning of the right-count score (as quantity and quality have the same value by default [both wrong marks and blanks are counted as zeros], only quantity is scored). The routine is free in Ministep software.

Richard A. Hart (1930-    )   Promotes student development by student self-assessment of what each student actually knows and can do, AFTER learning, with “next class period” feedback.

Knowledge and Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to reduce the time needed for students to write, and for me to score, short answer and essay questions. I created software (1981) to score multiple-choice, both right-count, and knowledge and judgment, to encourage students to take responsibility for what they were learning at all levels of thinking in any subject area. Students voted to give knowledge and judgment equal value. The right-count score retains the same meaning (quantity of right marks) as above. The knowledge and judgment score is a composite of the judgment score (quality, the “feel good” score AFTER learning) and the right-count score (quantity). Power Up Plus (2006) is classroom friendly (for students and teachers) and a free download: Smarter Test Scoring and Item Analysis.

Knowledge Factor (1995-    )   Promotes student learning and retention by assessing student knowledge and confidence, DURING learning, with “instant” feedback to develop “feeling good” during learning.

Knowledge Factor was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment program, Amplifire, places much more weight on confidence than on knowledge (a wrong mark may reduce the score by three times as much as a right mark adds). The software leads students through the steps needed to learn easily, quickly and in a depth that is easily retained for more than a year. Students do not have to master the study skills and the sense of responsibility needed to learn at all levels of thinking needed for master with KJS. Amplifire is student friendly, online, and so very commercially successful in developed topics that it is not free.


[Judgment and confidence are not the same thing. Judgment is measured by performance (percent of right marks), AFTER learning, at any level of student score. Confidence is a good feeling that Amplifier skillfully uses to promote rapid learning, DURING learning and self-assessment, into a mastery level. Students can take confidence in their practiced and applied self-judgment. The KJS and Amplifire test scores reflect the complete student. IMHO standardized tests should do this also, considering their cost in time and money.]