8
The test standard error of measurement (SEM) can be
calculated in two ways: The traditional way is by relating the variance between
student scores and within item difficulties; between an external column and the
internal cell columns.
The second way harvests the variance conditioned on each student
score and then sums the CSEM (SQRT(conditional student score error variance))
for the test. The first method links two properties: student ability and item
difficulty. The second only uses one property: student ability.
I set up a model with 12 students and 11 items (see previous
post and Table26.xlsm below). Extreme values of zero and 100% were excluded.
Four samples with average test scores of 5, 6, 7 (Table 29), and 8 were created
with the standard deviation (1.83) and the variance within item difficulties (1.83)
held constant. This allowed the SEM to vary between methods.
The calculation of the test SEM (1.36) by way of reliability
(KR20) is reviewed on the top level of Chart 73. The test SEM remained the same
for all four tests.
My first calculation of the test SEM by way of conditional
standard error of measurement (CSEM) began with the deviation of each mark from
the student score (Table 29 center). I squared the deviations and summed to get
the conditional variance for each score. The individual student CSEM is given
as the square root of the conditional variance (the SD of the conditional
variance). The test SEM (1.48) is then the sum of the student CSEM values.
[My second calculation was based on the binomial standard
error of measurement given in Crocker, Linda, and James Algina, 1986,
Introduction to Classical & Modern Test Theory, Wadsworth Group, pages
124-127.
By including the “correction for obtaining unbiased
estimates of population variance”, (n/(n – 1), the SEM value increased from
1.48 to 1.55 (Table 29). This is a perfect match to the binomial SEM.]
The two SEMs are then based on different sample sizes and
different assumptions. The traditional SEM (1.36) is based on the raggedly
distributed small sample size in hand. The binomial SEM (1.55) assumes a
perfectly normally distributed large theoretical population.
[Variance calculations (variance is additive):
- Test variance: Score deviations from the test mean (as counts), squared, and summed = a sum of squares (SS). SS/N = MSS or variance: 3.33. {Test SD = SQRT(Var) = 1.83. Test SEM = 1.36.}
- Conditional error variance: Deviations from the student score (as a percent), squared, and summed = the conditional error variance (CVar) for that student score. {Test SEM = Average SQRT(CVar) = 1.48 (n) and 1.55 (n-1)}
- Conditional error variance: Variance Within the Score row (Excel, VAR) x (n or n - 1) = the CVar for that student score. {Test SEM VAR.P = 1.48 and VAR.S = 1.55.]
Squaring values produces curved distributions (Chart 73). The
curves represent the possible values. They do not represent the
number of items or student scores having those values.
The True MSS = Total MSS – Error MSS = 3.33 -1.83 = 1.50,
involves subtracting a convex distribution centered on the
average test score from a concave distribution centered on
the maximum value of 0.25 (not on the average item difficulty).
The student score MSS is at a maximum when the item error SS
is at a minimum. The error MSS is at a maximum (0.25) when the student score
MSS is at a minimum (0.00). This makes sense. This item is perfectly aligned
with the student score distribution at a point where there is not differing
from the average test score.
The KR20 is then a ratio of the True MSS/Total MSS,
1.50/3.33 = 0.50. [KR20 ranges from 0 to 1, not reproducible to fully
reproducible]. The test SEM is then a portion, SQRT(1 – KR20) of the SD [also
1.83 in this example, SQRT(3.33)] = SQRT(1 – 0.50) * 1.83 = 1.36.
I was able to set the test SEM estimates using KR20 all to 1.36
for all four tests by setting the SD of student scores and the item error MSS
to constant
values by switching a 0 and 1 pair in student mark patterns. [The SD
and the item error MSS do not have to be the same values.]
All possible individual student score binomial CSEM
values for a test with 11 items are listed in Table 30. The CSEM is given as
the SQRT(conditional variance). The conditional variance is: (X * (n – X))/(n –
1) or n*(pg) * (n/(n - 1)). There is then no need to administer a test to calculate a student score binomial CSEM
value. There is a need to administer a test to find the test SEM. The test SEM
(Table 29) is the sum of these values, 1.55.
The student CSEM and thus the test SEM values are derived
only from student mark patterns. They differ from the test SEM values derived
from the KR20 (Table 31). With KR20 derived values held constant, the binomial CSEM
derived values for SEM decreased with higher test scores. This makes sense.
There is less room for chance events. Precision increases with higher test
scores.
Given a choice, a testing company would select the KR20
method using CTT analysis to report test SEM results.
[The same SEM values for tests with 5 right and 6 right
resulted from the fact that the median score was 5.5. The values for 5 right
and 6 right fall an equal distance from the mean on either side. Therefore 5
and 6 or 6 and 5 both add up to 11.]
I positioned the green curve on Chart 73 using the above
information.
A CSEM value is independent from the average test score and
item difficulties. (Swapping paired 0s and 1s in student mark patterns to
adjust the item error variance made no difference in the CSEM value.) The
average of the CSEM values, the test SEM, is dependent on the number of items on
the test with each value. If all scores are the same, the CSEMs and the SEM
will be the same (Tables 30 and 31).
I hope at this stage to have a visual mathematical model
that is robust enough to make meaningful comparisons with the Rasch IRT model. I
would like to return to this model and do two things (or have someone volunteer
do it):
- Combine all the features that have been teased out, in Chart 72 and Chart 73, into one model.
- Animate the model in a meaningful way with change gages and history graphs.
Now to return to the Nursing data that represent the real
classroom, filled with successful instruction, learning, and assessment.
- - - - - - - - - - - - - - - - - - - - -
Table26.xlsm, is now available free by request. (Files hosted at nine-patch.com are also being relocated now that Nine-Patch Multiple-Choice, Inc has been dissolved.)
The Best of the Blog - FREE
The Visual Education Statistics Engine (VESEngine) presents
the common education statistics on one Excel traditional two-dimensional
spreadsheet. The post includes definitions. Download
as .xlsm or .xls.
This blog started five years ago. It has meandered through
several views. The current project is visualizing
the VESEngine in three dimensions. The observed student mark patterns (on their
answer sheets) are on one level. The variation in the mark patterns (variance)
is on the second level.
Power Up Plus (PUP) is classroom friendly software used to
score and analyze what students guess (traditional multiple-choice) and what
they report as the basis for further learning and instruction (knowledge and
judgment scoring multiple-choice). This is a quick way to update your
multiple-choice to meet Common Core State Standards (promote understanding as
well as rote memory). Knowledge and judgment scoring originated as a classroom
project, starting in 1980, that converted passive pupils into self-correcting
highly successful achievers in two to nine months. Download as .xlsm or .xls.