Wednesday, September 10, 2014

Conditional Standard Error of Measurement - Precision

                                                              10    
(Continued from prior post.)

Table 32a contains two estimates (red) of the test standard error of measurement (SEM) that are in full agreement.  One estimate, 1.75, is from the average of the conditional standard error of measurements (CSEM, green) for each student raw score. The traditional estimate, 1.74, uses the traditional test reliability, KR20. No problem here.

The third estimate of the test SEM in Table 32c is different. It is based on CSEM values expressed in logits (the natural log, 2.718) rather than on the normal scale. The values are also inverted in relation to the traditional values in Table 32 (Chart 74). There is a small but important difference. The IRT CSEM values are much more linear that the CTT CSEM values. Also the center of this plot is the mean of the number of items (Chart 30, prior post), not the mean of the item difficulties or student scores. [Also most of this chart was calculated as most of these relationships do not require actual data to be charted. Only nine score levels came from the Nurse124 data.]

Chart 74 shows the binomial CSEM values for CTT (normal) and IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in logits) = 1/(SEM(Raw Score)”, 2007. I then adjusted each of these so the corresponding curves, on the same scale, crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The extreme values for no right and all right were not included. CSEM values for extreme values go to zero or to infinity with the following result:

“An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm

Precision is then not a constant across the range of student scores for both methods of analysis. The test SEM of 0.64 logits is comparable to 1.74 counts on the normal scale.

The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values estimates the test SEM as 1.75 counts out of a range of 21 items. This is less than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and useful.

For IRT analysis, the CSEM indicates the degree that the data fit the perfect Rasch model (3). A better fit also results in more believable and useful results.

“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation of an imagined error distribution representing the possible distribution of observed values around their “true” theoretical value. This precision is based on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the measures or estimates correspond to a reference standard outside the data, in this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm

Precision also has some very practical limitations when delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very neat tables showing the number of items that must be on a test to obtain a desired degree of precision expressed in logits and in confidence limits. The closer the test “targets” an average score of 50%, the fewer items needed for a desired precision.

The two top students, with the same score of 20, missed items with different difficulties. They both yield the same CSEM. The CSEM ignores the pattern of marks and the difficulty of items. A CSEM value obtained in this manner is related only to the raw score. Absolute values for the CSEM are sensitive to item difficulty (Table 23a and 23b).

The precision of a cut score has received increasing attention during the NCLB era. In part, court actions have made the work of psychometricians more transparent. The technical report for a standardized test can now exceed 100 pages. There has been a shift of emphasis from test SEM, to individual score CSEM, to IRT information as an explanation of test precision.

 “(Note that the test information function and the raw score error variance at a given level of proficiency [student  score], are analogous for the Rasch model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The conditional standard error of measurement is the inverse of the information function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.

I cannot end this without repeating that this discussion of precision is based on traditional multiple-choice (TMC) that only ranks students, a casino operation. Students are not given the opportunity to include their judgment of what they know or can do that is of value to themselves, and their teachers, in future learning and instruction, as is done with essays, problem solving, and projects. This is easily done with knowledge and judgment scoring (KJS) of multiple-choice tests.


(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.