13
Adding 22 balanced items to Table 33 of 21 items, in the
prior post, resulted in a similar average test score (Table 36) and the same
item information functions (the added items were duplicates of those in the
first Nurse124 data set of 21 items.) What happens if an unbalance set of 6
items is added? I just deleted the 16 high scoring additions from Table 36.
Both balanced additions (Table 36) and unbalanced additions (Table 39) had the
same extended range of item difficulties (5 to 21 right marks, or 23% to 95%
difficulty).
Table 33 |
Table 36 |
Adding a balanced set of items to the Nurse124 data set kept
the average score the same: 80% and 79% (Table 36). Adding a set of more
difficult items to the Nurse124 data decreased the average score to 70% (Table
39) and decreased student scores. Traditionally, a student’s overall score is
then the average of the three test scores: 80%, 79% and 70% or 76% for an
average student (Tables 33, 36, and 39). An estimate of a student’s “ability”
is thus directly dependent upon his test scores which are dependent upon the
difficulty of the items on each test. This score is accepted as a best estimate
of the student’s true score. This value is a best guess of future test scores.
This makes common sense, that past is a predictor of future performance.
[Again a
distinction must be made between what is being measured by right mark scoring
(0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a
test the student may not be able to read or understand. The other also
indicates the quality of each student’s knowledge; the ability to make meaningful
use of knowledge and skills. Both methods of analysis can use the exact same
tests. I continue to wonder why people are still paying full price but harvesting
only a portion of the results.]
The Rasch model IRT takes a very different route to
“ability”. The very same student mark data sets can be used. Expected IRT student
scores are based on the probability that half of all students with a given
ability location will correctly mark a question with a comparable difficulty
location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.) [The location starts from the natural
log of a ratio of right/wrong score and wrong/right difficulty. A convergence
of score and difficulty yields the final location. The 50% test score becomes
the zero logit location, the only point right mark scoring and IRT scores are
in full agreement.]
The Rasch model IRT converts student scores and item
difficulties [in the marginal cells of student data] into the probabilities of
a right answer (Table 33b). [The probabilities replace the marks in the central
cell field of student data.] It also yields raw student scores, and their conditional
standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather
than the count of right marks. (For
more see my Rasch Model Audit blog.)
Student ability becomes fixed and separated from the student
test score; a student with a given ability can obtain a range of scores on
future tests without affecting his ability location. A calibrated item can yield a range
of difficulties on future tests without affecting its difficulty calibrated location. This makes
sense only in relation to the trust you can have in the person interpreting IRT
results; that person’s skill, knowledge, and (most important) experience at all
levels of assessment: student performance expectations, test blueprint, and politics.
In practice, student data that do not fit well, “look
right”, can be eliminated from the data set. Also the same data set (Table 33,
Table 36, and Table 39) can be treated differently if it is classified as field
test, operational test, benchmark test, or current test.
At this point states recalibrated and creatively
equilibrated test results to optimize federal dollars during the NCLB era by
showing gradual continuing improvement. It is time to end the ranking of students by right mark
scoring (0,1 scoring) and include KJS,
or PCM (0,1,2 scoring) [that about every state education department has:
Winsteps], so that standardized testing yields the results needed to guide
student development: the main goal of the CCSS movement.
The need to equilibrate a test is an admission of failure.
The practice has become “normal” because failure is so common. It opened the
door to cheating at state and national levels. [To my knowledge no one has been
charged and convicted of a crime for this cheating.] Current computer adaptive
testing (CAT) hovers about the 50% level of difficulty. This optimizes
psychometric tools. Having a disinterested party outside of the educational
community doing the assessment analysis and online CAT
reduce the opportunity to cheat. They do not IMHO optimize the usefulness of
the test results. End-of-course tests are now molding standardized testing into
an instrument to evaluate teacher effectiveness rather than assess student
knowledge and judgment (student development).
- - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - -
The Best of the Blog - FREE
The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.
Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.