10
(Continued from prior post.)
(Continued from prior post.)
Table 32a contains two estimates (red) of the test standard
error of measurement (SEM) that are in full agreement. One estimate, 1.75, is from the average
of the conditional standard error of measurements (CSEM, green) for each
student raw score. The traditional estimate, 1.74, uses the traditional test
reliability, KR20. No problem here.
The third estimate of the test SEM in Table 32c is different.
It is based on CSEM values expressed in logits (the natural log, 2.718) rather
than on the normal scale. The values are also inverted in relation to the
traditional values in Table 32 (Chart 74). There is a small but important
difference. The IRT CSEM values are much more linear that the CTT CSEM values.
Also the center of this plot is the mean of the number of items (Chart 30,
prior post), not the mean of the item difficulties or student scores. [Also
most of this chart was calculated as most of these relationships do not require
actual data to be charted. Only nine score levels came from the Nurse124 data.]
Chart 74 shows the binomial CSEM values for CTT (normal) and
IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in
logits) = 1/(SEM(Raw Score)”, 2007.
I then adjusted each of these so the corresponding curves, on the same scale,
crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The
extreme values for no right and all right were not included. CSEM values for
extreme values go to zero or to infinity with the following result:
“An apparent paradox is that extreme scores have perfect
precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm
Precision is then not a constant across the range of student
scores for both methods of analysis. The test SEM of 0.64 logits is comparable
to 1.74 counts on the normal scale.
The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range
in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values
estimates the test SEM as 1.75 counts out of a range of 21 items. This is less
than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and
useful.
For IRT analysis, the CSEM indicates the degree that the data
fit the perfect Rasch model (3). A
better fit also results in more believable and useful results.
“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation
of an imagined error distribution representing the possible distribution of
observed values around their “true” theoretical value. This precision is based
on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the
measures or estimates correspond to a reference standard outside the data, in
this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm
Precision also has some very practical limitations when
delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very
neat tables showing the number of items that must be on a test to obtain a
desired degree of precision expressed in logits and in confidence limits. The
closer the test “targets” an average score of 50%, the fewer items needed for a
desired precision.
The
two top students, with the same score of 20, missed items with different
difficulties. They both yield the same CSEM. The CSEM ignores the pattern of
marks and the difficulty of items. A CSEM value obtained in this manner is
related only to the raw score. Absolute values for the CSEM are sensitive to
item difficulty (Table 23a and 23b).
The precision of a cut score has received increasing
attention during the NCLB era. In part, court actions have made the work of
psychometricians more transparent. The technical report for a standardized test
can now exceed 100 pages. There has been a shift of emphasis from test SEM, to
individual score CSEM, to IRT information
as an explanation of test precision.
“(Note that the
test information function and the
raw score error variance at a given level of proficiency [student score], are analogous for the Rasch
model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The
conditional standard error of measurement is the inverse of the information
function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.
I cannot end this without repeating that this discussion of
precision is based on traditional multiple-choice (TMC) that only ranks
students, a casino operation. Students are not given the opportunity to include
their judgment of what they know or can do that is of value to themselves, and
their teachers, in future learning and instruction, as is done with essays,
problem solving, and projects. This is easily done with knowledge and judgment
scoring (KJS) of multiple-choice tests.
(Continued)
- - - - - - - - - - - - - - - - - - - - -
Table26.xlsm, is now available free by request.
The Best of the Blog - FREE
The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.
Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.
No comments:
Post a Comment