The meaning of a multiple-choice test score is determined by
several factors in the testing cycle including test creation, test
instructions, and the shift from teacher to student being responsible for
learning and reporting. Luck-on-test-day, in this discussion, is considered to
have similar effects on the following scoring methods.
[Luck-on-test-day includes but is not limited to: test
blueprint, question author, item calibration, test creator, teacher,
curriculum, standards; classroom, home, and in between, environment; and a
little bit of random chance (act of God that psychometricians need to smooth
their data).]
Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part
multiple-choice, and knowledge and judgment scoring (KJS) for both short
answer and multiple-choice. These range from familiar manual scoring to what is
now easily done with KJS computer software. Each method of scoring has a
different starting score with a different meaning. The average customary class room
score of 75% is assumed (60% passing).
Open ended short
answer scores start with zero and increase with each acceptable answer.
There may be several acceptable answers for a single short answer question. The
level of thinking required depends upon the stem of the question. There may be
an acceptable answer for a question both at lower and at higher levels of
thinking. These properties carry over into KJS below.
The teacher or test maker is responsible for scoring the
test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4).
The quality of the answers can be judged by the scorer and may influence which
ones are considered right answers.
The open ended short answer question is flexible (multiple
right answers) and with some subjectivity; different scorers are expected to
produce similar scores. The average test score is controlled by selecting a set
of items that is expected to yield an average test score of 75%. The student test
score is a rank based on items included in the test to survey what students
were expected to master, to group students who know from those who do not know
each item, and items that fail to show mastery or discrimination (unfinished
items for a host of reasons including luck-on-test-day above).
The open ended short answer question can also be scored as a
multiple-choice item. First tabulate the answers. Sort the answers from high to
low count. The most frequent
answer, on a normal question, will be the right answer option. The next three
ranking answers will be real student supplied wrong answer options (rather than
test writer created wrong answer options). This pseudo-multiple-choice item can
now be printed as a real question on your next multiple-choice test (with
answers scrambled).
A high quality student could also mark only right answers on
the first pass using the above test (Chart 1/4) and then finish by just marking
on the second pass to earn a score of 60%. A lower quality student could just
mark each item in order, as is usually done on multiple-choice tests, mixing
right and wrong marks, to earn the same score of 60%. Using only a score after
the test we cannot see what is taking place during the test. Turning a short
answer test into traditional multiple-choice hides student quality, the very
thing that the CCSS movement is now promoting.
Chart 2/4 |
Closed ended
right-count four-option multiple-choice scores start with zero and increase
with each right mark. Not really!! This is only how this method of scoring has
been marketed for a century by only considering a score based on right-counts
after the test is completed. In the first place traditional multiple-choice is
not multiple-choice, but forced-choice (it lacks one option discussed below).
This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This
evil flaw in test design was countered, over 50 years ago, by a now defunct
“formula scoring”. After forcing students to guess, psychometricians wanted to remove
the effect of just marking! It took the SAT until March of this year, 2014, to
drop this “score correction”.
[Since there was no way to tell which right answer must be
changed for the correction, it made no sense to anyone other than psychometricians
wanting to optimize their data reduction tools, with disregard for the effect of
the correction on the students taking such a test. Now that 4-option questions
have become popular on standardized tests, a student who can eliminate one
option can guess from the remaining three for better odds on getting a right
mark (which is not necessarily a right answer that reflects recall,
understanding, or skill).]
The closed ended right-count four-option multiple-choice question
is inflexible (one right answer) and with no scoring subjectivity; all scorers
yield the same count of right marks. Again, the average test score is
controlled by selecting a set of items expected to yield 75% on-average (60%
passing). However, this 75% is not the same as that for the open ended short
answer test. As a forced-choice test, the multiple-choice test will be easier;
it starts with a 25% on-average advantage. (That means one student may start
with 15% and a classmate with 35%.) To further confound things, the level of
thinking used by students can also vary. A forced-choice test can be marked
entirely at lower levels of thinking.
[Standardized tests control part of the above problems by
eliminating almost all mastery and unfinished items. The game is to use the
fewest items that will produce a desired score distribution with an acceptable
reliability. A traditional multiple-choice scored standardized test score of 60%
is a much more difficult accomplishment than the same score on a classroom
test.]
A forced-choice test score is a rank of how well a student
did on a test. It is not a report of what a student actually knows or can do
that will serve as the basis for further instruction and learning. The
reasoning is rather simple: the forced-choice score is counted up AFTER the
test is finished; this is the final game score. How the game started (25%
on-average) and was played is not observed (but this is what sports fans pay
for). This is what students and teachers need to know so students can take
responsibility for self-corrective learning.
Chart 3/4 |
Chart 4/4 |
Knowledge and
Judgment scores start at 50% for every student plus one point for
acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice).
(Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right = 2) Omitting an answer is good
judgment to report what has yet to be learned or to be done (understood).
Omitting keeps the one point for good judgment. An unacceptable or wrong mark
is poor judgment. You lose one point for bad judgment.
Now what is hidden with forced-choice scoring is visible
with knowledge and Judgment Scoring (KJS). Each student can show how the game is
played. There is a separate student score for quantity and for quality. A
starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge
Factor sets the starting score near 75%. Judgment is far more important than
knowledge in high risk occupations.]
KJS includes a fifth answer option: omit (good judgment to
report what has yet to be learned or understood). When this option is not used,
the test reverts to forced-choice scoring (marking one of the four answer
options for every question).
A high quality student marked 10 right out of 10 marked and
then omitted the remainder (in two passes through the test) or managed to do a
few of one right and one wrong (three passes) for a passing score of 60% in
Chart 4/4. A student of less quality did not omit but just marked for a score
of less than 50%. A lower level of thinking, low quality student marked 10
right and just marked the rest (two passes) for a score of less than 40%. KJS
yields a score based on student judgment (60%) or on the lack of that judgment
(less than 50%).
In summary, the
current assessment fad is still oriented on right marks rather than on student
judgment (and development). Students with a practiced good judgment develop the
sense of responsibility needed to learn at all levels of thinking. They do not
have to wait for the teacher to tell them they are right. Learning is stimulated
and exhilarating. It is fun to learn when you can question, get answers, and
verify a right answer or a new level of understanding; when you can build on
your own trusted foundation.
Low quality students learn by repeating the teacher. High
quality students learn by making sense of an assignment. Traditional multiple-choice
(TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards
all-levels-of-thinking. TMC requires little sense of responsibility. KJS
rewards (encourages) the sense of responsibility needed to learn at all levels
of thinking.
1. A
short answer, hand scored, test score
is an indicator of student ability and class ranking based on the scorer’s
judgment. The scorer can make a subjective estimate of student quality.
2. A
TMC score is only a rank on a completed
test with increased confounding at lower scores. A score matching a short
answer score is easier to obtain in the classroom and much more difficult to
obtain on a standardized test.
3. A
KJS test score is based on a
student, self-reporting, estimate of what the student knows and can do on a
completed test (quantity) and an estimate of the student’s ability to make use of that
knowledge (judgment) during the test (quality). The score has student judgment
and quality, not scorer judgment and quality.
In short, students who know that they can learn (get rapid
feedback on quantity and quality),who
want to learn, enjoy learning (see Amplifire below). All testing methods fail
to promote these student development characteristics unless the test results
are meaningful, easy to use by students and teachers, and timely. Student
development requires student performance, not just talking about it or labeling
something formative assessment.
Power Up Plus (PUP or
PowerUP) scores both TMC and KJS. Students have the option of selecting the
method of scoring they are comfortable with. Such standardized tests have the
ability to estimate the level of thinking used in the classroom and by each
student. Lack of information,
misinformation, misconceptions and cheating can be detected by school, teacher,
classroom, and student.
Power Up Plus is hosted at TeachersPayTeachers to share what
was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual
teachers who want to upgrade their multiple-choice tests for formative,
cumulative, and exit ticket assessment. Good teachers, working within the
bounds of accepted standards, do not need to rely on expensive assessments.
They (and their students) do need fast, easy to use, test results to develop
successful high quality students.
I hope your students respond with the same positive
enthusiasm that over 90% of mine did. We need to assess students to promote
their abilities. We do not need to primarily assess students to promote the
development of psychometric tools that yield far less than what is marketed.
A Brief History:
Geoff Masters (1950- )
A modification of traditional multiple-test test performance.
Created partial
credit scoring for the Rasch model (1982) as a scoring refinement for
traditional right-count multiple-choice. It gives partial credit for near right
marks. It does not change the meaning of the right-count score (as quantity and
quality have the same value by default [both wrong marks and blanks are counted
as zeros], only quantity is scored). The routine is free in Ministep software.
Richard A. Hart (1930- )
Promotes student development by student self-assessment of what each
student actually knows and can do, AFTER learning, with “next class period”
feedback.
Knowledge and
Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to
reduce the time needed for students to write, and for me to score, short answer
and essay questions. I created software (1981) to score multiple-choice, both
right-count, and knowledge and judgment, to encourage students to take responsibility
for what they were learning at all levels of thinking in any subject area. Students
voted to give knowledge and judgment equal value. The right-count score retains
the same meaning (quantity of right marks) as above. The knowledge and judgment
score is a composite of the judgment score (quality, the “feel good” score
AFTER learning) and the right-count score (quantity). Power
Up Plus (2006) is classroom friendly (for students and teachers) and a free
download: Smarter
Test Scoring and Item Analysis.
Knowledge Factor
(1995- ) Promotes student learning and retention by assessing student
knowledge and confidence, DURING learning, with “instant” feedback to develop “feeling
good” during learning.
Knowledge Factor
was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment
program, Amplifire, places much more weight on confidence than on knowledge (a
wrong mark may reduce the score by three times as much as a right mark adds).
The software leads students through the steps needed to learn easily, quickly
and in a depth that is easily retained for more than a year. Students do not
have to master the study skills and the sense of responsibility needed to learn
at all levels of thinking needed for master with KJS. Amplifire is student friendly, online,
and so very commercially successful in developed topics that it is not free.
[Judgment and confidence are not the same thing. Judgment is
measured by performance (percent of right marks), AFTER learning, at any level
of student score. Confidence is a good feeling that Amplifier skillfully uses
to promote rapid learning, DURING learning and self-assessment, into a mastery
level. Students can take confidence in their practiced and applied
self-judgment. The KJS and Amplifire test scores reflect the complete student.
IMHO standardized tests should do this also, considering their cost in time and
money.]