1
Two recent news items highlight the problems produced by
faulty communication in the assessment, education and political communities. A
test may be inadequate to deliver the requested information resulting in two
different scenarios: The test is used in the state of Washington.
A test is not used but replaced with a Projection Measure that “was so silly that
it was killed” after a brief use in Texas.
Large amounts of money are involved in such exercises. A
satisfied customer in this area must be able to understand the limits of what
is being purchased. We do not want to show up for dinner at 7:00pm to find it
was served at 12:00 noon. Dinner
and lunch can refer to the same thing and to different things. It depends upon
the culture.
Psychometricans have been lax in communicating what they do
in an understandable form to the cultures that finance them and to those who
attempt to make valid use of their work. During 50 years of experience, I have
not found a unified expression of common education statistics or a way of
accomplishing that feat that is meaningful and therefore useful. The personal
computer, the interactive spreadsheet, and the Internet should now make this
possible.
This set of posts is designed so that anyone interested in
the topic of multiple-choice testing can see inside of six commonly used
education statistics. The series will also include Excel what-if engines to
animate them. You only understand
after you have experienced. It is
only when several statistics are combined that the interactions and limits
become visible. Combining statistics interactively also simplifies the naming
of variables as only one name is needed where several may be used otherwise.
I will attempt to produce an understandable graphic for each
of six common education statistics that I have encountered being used with
traditional multiple-choice tests (TMC):
- count
- average or mean
- standard deviation of the mean or the spread of the distribution of scores
- test reliability or the ability to reproduce the same scores
- standard error of measurement or the range in which a student’s score may fall
- item discrimination or the ability of a question to group students into one group that knows (and is lucky) and one group that does not know (and is unlucky).
If you are comfortable with traditional education
statistics, you may want to skip to the first spreadsheet: Test Reliability
Engine. If you are interested in the findings summary of this audit, skip to
[to be posted]. If you are interested in the details as I work through this
project, please read on.
Your comments will be appreciated, especially errors and
omissions (corrections are easily made on a blog). I want the facts to be
readily seen and understood rather than you relying on me as one more authority
(“trust me”, from Jungle Book, and any number of commercial, education and
political organizations).
Please practice with your students using Break Out (free) to learn to understand the difference between traditional multiple-choice
(TMC) and Knowledge and Judgment Scoring (KJS). The Common Core State Standards
(CCSS) movement demands that passive pupils become engaged active
self-correcting high quality achievers.
The student mark data from the Nursing124.ANS file contains
the right marks by 22 students on 21 questions. Extreme scores and difficulties
(100%) were eliminated from the 24 by 24 matrix when I was working on my audit
of the Rasch model.
Statistic One: Right mark counts yield student scores (rows) and item difficulties (columns).
The value of each student score mark (1 or 0) is not affected by item
difficulty or the level of thinking used in making the mark. The value of each
item difficulty mark or item score (1 or 0) is not affected by student score or
student ability. A right mark is a right mark (1). The more right marks you
get, the better, is meaningful to everyone using traditional multiple-choice
(TMC).
[The above remarks are prompted from my audit of the Rasch
IRT model. The claim
(see Number of IRT Parameters) is made that student abilities are independent
from item difficulties and item difficulties are independent from student
abilities using the one-parameter IRT model. I am willing to believe that
theory but I have yet to see it. I do not know or understand it based only on how
estimates of student ability and item difficulty are made.]
Counts are typically listed in a mark, or item score, table.
Student scores are entered at the end of rows. Item difficulties are listed at
the lower end of columns. This looks very clean and simple (1 and 0),
especially when compared to what is being attempted to be measured. A mark of 1
or 0 may result from many factors that are related to the item, or to the
student, or to factors indirectly related to the test environment (race, religion, parenting, etc.).
A good analogy is a test plot of corn kernels from several
ears of corn (rows) placed in several types of soil (columns). The scoring is
based on the seedlings. Several factors can be scored: color; development of
leaf, stem, and roots; size of plant, stem and root; sturdiness; and etc. But
in education, with traditional multiple-choice (TMC), there would be but two
scores: 1 for a seedling, and 0 for none. A 1 would be recorded for both a corn
seedling and a weed seedling. A weed corresponds to good luck in marking a
right answer. All the other factors that influence student marks are ignored.
Even in Table 2 all right answers have been replaced with a
single symbol to make the chart easier to view. That symbol will become a 1
using TMC. Each wrong mark, regardless of the answer option, will become a
0.
But one factor, other than right/wrong, can be obtained
directly from the answer sheets. That factor is student judgment. Student
judgment is as important as knowing and doing, in moving students from lower to
higher levels of thinking. The CCSS movement demands the development of student
judgment.
Counting right marks is simple. However, each mark is not
reporting the exact same thing. Forcing students to mark “the best answer” and
counting right marks produces a quantitative score locked to a qualitative
score (that is why only one score is reported using TMC, as the two scores are
identical). That deficiency is easily corrected by the Rasch IRT partial credit model
(PCM) or by Knowledge and Judgment Scoring
(KJS).
KJS yields independent scores of quantity (1 or 0) and scores
of quality (scoring a student’s judgment to report what is actually known or
can be done, that is the basis for further learning and instruction). Weeds can
be differentiated from corn.
With KJS both teachers and high quality students know what
is known and can be done during the
test as well as afterwards. By scoring for knowledge and judgment (quantity and
quality) we can reduce the weeds in the corn. We can identify and correct
misconceptions. Instruction can be more effective.
The most important thing that can be said at this point is
that what you count and how you count determines the value of everything that
follows. TMC, with right mark
scoring, extracts the lease amount of information with the least value from a
multiple-choice test. You get the least return for the time and money invested:
a ranking.
Tradition seems to be the main reason TMC is still used. KJS and the PCM both shift the
responsibility for learning and reporting from the teacher to the student. This
shift is now a key element in the Common Core State Standards (CCSS) movement.
It promotes the change from a classroom of passive followers to an active
classroom of self-correcting high quality successful achievers. Assessing judgment may now
become acceptable, and even required, when using multiple-choice tests (as it is in
most other assessments).
Students like to be free to report what they trust they know
and can do. But this must be experienced to be understood, appreciated, and accepted.
After two tests, over 90% of my 3,000 students switched from guessing at
answers on a multiple-choice test, to using it to report what they trusted they
knew or could do. Teachers also need to experience before they understand (scoring
judgment with multiple-choice tests is still a new professional development
topic).
The CCSS movement demands doing, not talking and listening.
To make the most of this series of posts, download Break Out. (It
is in entirely free open source code.) Use it to help break out of an old antiquated
failing tradition that emphasizes one right answer instead of the CCSS requirement of developing the
ability and mindset to apply what is known to a range of questions or tasks.
No comments:
Post a Comment