21
[[Second Pass, 8 July 2014. Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).
[[Second Pass, 8 July 2014. Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).
Also the discussion of Table 24
has been updated to match the terms and values in Table 24.]]
Working on the conditional standard error of measurement
(CSEM) is new territory for me. I always associated the CSEM with the Rasch
model IRT analysis commonly used by state departments of education when scoring
NCLB tests. I first had to Google for basic information.
If you are interested in the details, please check out these
sources for sample (n1) equations: (Equation 6.14 that corrects the relative
variance was not included in the 2005 version of the current 2008 version. This
represents a significant progress in applying test precision.)
 Absolute Error Variance Equation 5.39 p. 73
 Relative Error Variance Equation 6.3 p. 83
 Corrected Relative Variance Equation 6.14 p. 91 or GED Equation 3 p. 9
My first surprise was to find I had already calculated the
CSEM for the Nursing124 data when I put up Post 5 of this series (in Table 8. Interactions with
Columns [Items] Variance, MEAN SS = 3.33) as I discovered five ways to harvest the variance [mean sum of squares (MSS)]. Equation 6.3 n, Table 22, produces the same
result (test SEM = 1.75) when it divides by n [unknown
population] rather than n1 [observed sample].
[n = the item count. Test SEM = AVERAGE(CSEM).]
I then used what I learned in the last post to table data to
obtain the conditional error variance for student scores (Table 23a). The 21 items in Table 22 became the number of right marks on each of 11 item difficulties on Table 23a. The
values in this tabulation were then converted into frequencies conditional on the student scores; the sum of which added to one, for each score (Table 23b).
The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]
The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]
The conditional frequency estimated test SEM was 1.68 (Table
23b). The conditional frequency CSEM values for each score were
different for students with the same score. The CSEM values had to be averaged to get
results comparable with the other analyses. These values generated an irregular
curve, unlike the smooth curve for the other analyses (Chart 61). The
conditional frequency CSEM analysis is sensitive to the number of items with the same difficult (yellow bars
alternate for each change in value, Table 23b). The other analyses are not
sensitive to item difficulty (yellow bars, in Table 22, include all students with the same
score).
Complete curves were generated from Equation 6.3 for n1 and
for GED n1 (Table 24). The GED n1 analysis includes a correction factor (cf) for
the range of item difficulties on
the test [cf = (1 KR20)/(1KR21)]. This factor is equal to one if all items
are of equal difficulty. For the Nursing123 data it was 1.59; the difficulties
ranged from 45% to 95%, from the middle of the total possible
distribution to one extreme.
The CSEM values from the six analyses are listed in Table 24.
Five are fairly close to one another. The GED n1, with a correction for the
range of item difficulties, is far different from the other five (Chart 61).
Values could not be created for the full curve for conditional frequencies as
you must actually have student marks to calculate conditional frequency CSEM
values. The gray area shows the values calculated from an equation for which
there were no actual data. Equations produce nice looking, “look right”,
reports.
The CSEM improves the reportable precision on this test over
using the test SEM. Good judgment (best practice) is to correct the CSEM values
as done on the GED n1 analysis.
[I did not transform the raw test score mean of 16.8 or
79.8% to a scale score of 50% as was done by Setzer, 2009, GED, p. 6 and Tables
2 and 3. The GED n1 raw score cut point was 60% which is comparable to most
classroom tests. If 25% of the score is from luck on test day that leaves 35%
for what a student marked right as something known or could be done, as a worst
case. If half of the lucky marks were also something the student knew or could
do, the split would be about 10% for luck on test day and 50% for student
ability.]
In Table 24, the GED n1 analysis test SEM of 2.98 for the Nursing124
data is, as a range, 2.98/21 or 14.19%. For the uncorrected Equation 6.3 n1
analysis, 1.79, the range is 1.79/21 or 8.52%. The n SEM was 1.75 or 7.95%.
The n SEM range, 1.75, fits within the uncorrected n  1 test SEM value, 1.79. The corrected GED n1 test
SEM value, 2.98, exceeds it.
Student score CSEM values are even more sensitive than the
test SEM values. The maximum range for the GED n1 analysis is 3.73 or 3.73/21
or 17.76% and for the Equation 6.3 n1 analysis 2.35 or 11.19%. Both are beyond
the maximum n CSEM value of 2.29 or 10.41%. This low quality set of data fails to qualify as a means of
setting classroom grades or a standardized test cut score.
[However the classroom rule of 75% for passing the course
and the rule for grades set at 10 percentage points over rule these statistics.
Here is a good example that test statistics have meaning only in relation to
how they are used. If the process of data reduction and reporting is not
transparent, the resulting statistics are suspect and can produce extended
debates over a passing score in the classroom.]
The CSEM for each student score does improve test precision.
It can be calculated in several ways with close agreement. But it cannot
improve the quality of the student marks on the answer sheets made under
traditional, forcedchoice, multiplechoice rules. These tests only rank
students by the number of right marks. They do not ask students, or allow
students to report, what they really know or can do; their judgment in using
what they know or can do.
The CCSS movement is now promoting learning at higher levels
of thinking (problem solving) with, from which I have learned, some
deemphasis on lower levels of
thinking that are the foundation for higher levels of thinking. A successful
student cycles through all levels of thinking, as is needed. Yet half of the CCSS
testing will be at the lowest levels of thinking, traditional multiplechoice
scoring. The other half will be as much of an over kill as traditional
multiplechoice is an under kill in assessing student knowledge, skills, and
student development to learn and apply their abilities. Others
have this same concern that centralized politics (and dollars) will continue to
overshadow the reality of the classroom.
There is a middle ground that makes every question function
at higher levels of thinking, allows students to report what is meaningful, of
value, and empowering, and has the speed, low cost, and precision of
traditional multiplechoice. Knowledge and
Judgment Scoring and partial
credit Rasch model IRT are two examples. They both accommodate students
functioning at all levels of thinking.
Lower ability students do not have to guess their way through a test. With
routine use, both can turn passive pupils into selfcorrecting highly
successful achievers in the classroom. If you are really into mastery learning,
you can also try something like Knowledge
Factor.
                   

Free software to help you and your students
experience and understand how to break out of traditionalmultiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment