## Wednesday, October 2, 2013

### Visual Education Statistics - Conditional Standard Error of Measurement

21

[[Second Pass, 8 July 2014.  Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).

Also the discussion of Table 24 has been updated to match the terms and values in Table 24.]]

Working on the conditional standard error of measurement (CSEM) is new territory for me. I always associated the CSEM with the Rasch model IRT analysis commonly used by state departments of education when scoring NCLB tests. I first had to Google for basic information.

If you are interested in the details, please check out these sources for sample (n-1) equations: (Equation 6.14 that corrects the relative variance was not included in the 2005 version of the current 2008 version. This represents a significant progress in applying test precision.)

•        Absolute Error Variance                 Equation 5.39 p. 73
•        Relative Error Variance                  Equation 6.3 p. 83
•        Corrected Relative Variance           Equation 6.14 p. 91 or GED Equation 3 p. 9

My first surprise was to find I had already calculated the CSEM for the Nursing124 data when I put up Post 5 of this series (in Table 8. Interactions with Columns [Items] Variance, MEAN SS = 3.33) as I discovered five ways to harvest the variance [mean sum of squares (MSS)]. Equation 6.3 n, Table 22, produces the same result (test SEM = 1.75) when it divides by n [unknown population] rather than n-1 [observed sample].

[n = the item count. Test SEM = AVERAGE(CSEM).]

I then used what I learned in the last post to table data to obtain the conditional error variance for student scores (Table 23a). The 21 items in Table 22 became the number of right marks on each of 11 item difficulties on Table 23a. The values in this tabulation were then converted into frequencies conditional on the student scores; the sum of which added to one, for each score (Table 23b).

The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]

The conditional frequency estimated test SEM was 1.68 (Table 23b). The conditional frequency CSEM values for each score were different for students with the same score. The CSEM values had to be averaged to get results comparable with the other analyses. These values generated an irregular curve, unlike the smooth curve for the other analyses (Chart 61). The conditional frequency CSEM analysis is sensitive to the number of items with the same difficult (yellow bars alternate for each change in value, Table 23b). The other analyses are not sensitive to item difficulty (yellow bars, in Table 22, include all students with the same score).

Complete curves were generated from Equation 6.3 for n-1 and for GED n-1 (Table 24). The GED n-1 analysis includes a correction factor (cf) for the range of item difficulties on the test [cf = (1- KR20)/(1-KR21)]. This factor is equal to one if all items are of equal difficulty. For the Nursing123 data it was 1.59; the difficulties ranged from 45% to 95%, from the middle of the total possible distribution to one extreme.

The CSEM values from the six analyses are listed in Table 24. Five are fairly close to one another. The GED n-1, with a correction for the range of item difficulties, is far different from the other five (Chart 61). Values could not be created for the full curve for conditional frequencies as you must actually have student marks to calculate conditional frequency CSEM values. The gray area shows the values calculated from an equation for which there were no actual data. Equations produce nice looking, “look right”, reports.

The CSEM improves the reportable precision on this test over using the test SEM. Good judgment (best practice) is to correct the CSEM values as done on the GED n-1 analysis.

[I did not transform the raw test score mean of 16.8 or 79.8% to a scale score of 50% as was done by Setzer, 2009, GED, p. 6 and Tables 2 and 3. The GED n-1 raw score cut point was 60% which is comparable to most classroom tests. If 25% of the score is from luck on test day that leaves 35% for what a student marked right as something known or could be done, as a worst case. If half of the lucky marks were also something the student knew or could do, the split would be about 10% for luck on test day and 50% for student ability.]

In Table 24, the GED n-1 analysis test SEM of 2.98 for the Nursing124 data is, as a range, 2.98/21 or 14.19%. For the uncorrected Equation 6.3 n-1 analysis, 1.79, the range is 1.79/21 or 8.52%. The n SEM was 1.75 or 7.95%. The n SEM range, 1.75, fits within the uncorrected n - 1 test SEM value, 1.79. The corrected GED n-1 test SEM value, 2.98, exceeds it.

Student score CSEM values are even more sensitive than the test SEM values. The maximum range for the GED n-1 analysis is 3.73 or 3.73/21 or 17.76% and for the Equation 6.3 n-1 analysis 2.35 or 11.19%. Both are beyond the maximum n CSEM value of 2.29 or  10.41%. This low quality set of data fails to qualify as a means of setting classroom grades or a standardized test cut score.

[However the classroom rule of 75% for passing the course and the rule for grades set at 10 percentage points over rule these statistics. Here is a good example that test statistics have meaning only in relation to how they are used. If the process of data reduction and reporting is not transparent, the resulting statistics are suspect and can produce extended debates over a passing score in the classroom.]

The CSEM for each student score does improve test precision. It can be calculated in several ways with close agreement. But it cannot improve the quality of the student marks on the answer sheets made under traditional, forced-choice, multiple-choice rules. These tests only rank students by the number of right marks. They do not ask students, or allow students to report, what they really know or can do; their judgment in using what they know or can do.

The CCSS movement is now promoting learning at higher levels of thinking (problem solving) with, from which I have learned, some de-emphasis  on lower levels of thinking that are the foundation for higher levels of thinking. A successful student cycles through all levels of thinking, as is needed. Yet half of the CCSS testing will be at the lowest levels of thinking, traditional multiple-choice scoring. The other half will be as much of an over kill as traditional multiple-choice is an under kill in assessing student knowledge, skills, and student development to learn and apply their abilities. Others have this same concern that centralized politics (and dollars) will continue to overshadow the reality of the classroom.

There is a middle ground that makes every question function at higher levels of thinking, allows students to report what is meaningful, of value, and empowering, and has the speed, low cost, and precision of traditional multiple-choice. Knowledge and Judgment Scoring and partial credit Rasch model IRT are two examples. They both accommodate students functioning at all levels of thinking. Lower ability students do not have to guess their way through a test. With routine use, both can turn passive pupils into self-correcting highly successful achievers in the classroom. If you are really into mastery learning, you can also try something like Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):