11
(Continued from the prior two posts.)
The past two posts have established that there is little
difference between classical test theory (CTT) and item response theory (IRT)
in respect to test reliability and conditional error of measurement (CSEM)
estimates (other than the change in scales). IRT now is the analysis of choice
for standardized tests. The Rasch model IRT is the easiest to use and also works
well with small data sets including classroom tests. How two normal scales for
student scores and item difficulties are combined onto one IRT logit scale is
no longer a concern to me, other than the same method must be used throughout the
duration of an assessment program.
Table 33 |
Chart 75 |
Chart 76 |
The item with an 18 right count/92% right difficulty (Table 35) did not
contribute the most information individually but did as a group of three items
(9.17). The three highest scoring,
easiest, items (counts of 19, 20, and 21) are just too easy for a standardized
test but may be important survey items needed to verify knowledge and skills for
this class of high performing students. None of these three items reached an information level
maximum of 1/4. [It now becomes apparent how items can be selected to produce a
desired test information function.]
The more information available is interpreted as greater
precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q))
on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT
+ WG)^2 or (3 x 3)/36 = 1/4].
Each item information function spans the range of student
scores on the test (Chart 76). Each item information function measures student
ability most precisely near the point that item difficulty and student ability
match (50% right) along the IRT S-curve. [The more difficult an item, the more
ability students must have to mark correctly 50% of the time. Student ability
is the number correct on the S-curve. Item difficulty is the number wrong on the
S-curve (see more at Rasch Model
Audit).]
Extracting item information functions from a data table
provides a powerful tool (a test information function) for psychometricians to
customize a test (page 127, Maryland
2010). A test can be adjusted for maximum precision (minimum
CSEM) at a desired cut point.
The bright side of this is that the concept of “information”
(not applicable to CTT), and the ability to put student ability and item
difficulty on one scale, gives psychometricians powerful tools. The dark side
is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the
NCLB era, as psychometrics has made marked improvements, the student mark data
it is being supplied has remained in the casino arena: Mark an answer to each
question (even if you cannot read or understand the question), do not guess,
and hope for good luck on test day.
The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak
psychometric performance. Standardized testing has migrated away from letting
students report what they know and can do to a lottery that compares their
performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day.
The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.]
The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.]
It does not have to be that way. Knowledge and Judgment Scoring (classroom
friendly) and the partial
credit Rasch model (that is included in the software states use) can both
update traditional multiple-choice to the levels of thinking required by the
common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT.
- - - - - - - - - - - - - - - - - - - - -
The Best of the Blog - FREE
The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.
Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.