Multiple-Choice Reborn: Customizing Test Precision

(Continued from the prior two posts.)

The past two posts have established that there is little difference between classical test theory (CTT) and item response theory (IRT) in respect to test reliability and conditional error of measurement (CSEM) estimates (other than the change in scales). IRT now is the analysis of choice for standardized tests. The Rasch model IRT is the easiest to use and also works well with small data sets including classroom tests. How two normal scales for student scores and item difficulties are combined onto one IRT logit scale is no longer a concern to me, other than the same method must be used throughout the duration of an assessment program.

Table 33

What is new and different from CTT is an additional insight from the IRT data in Table 32c (information p*q values). I copied Table 32 into Table 33 with some editing. I colored the cells holding the maximum amount of information (0.25) yellow in Table 33c. This color was then carried back to Table 33a, Right and Wrong Marks. [Item Information is related to the marginal cells in Table 33a (as probabilities), and not to the central cell field (as mark counts).] The eleven item information functions (in columns) were re-tabled into Table 34 and graphed in Chart 75. [Adding the information in rows yields the student score CSEM in Table 33c.]

Table 34

Chart 75

The Nurse124 data yielded an average test score of 16.8 marks or 80%. This skewed the item information functions away from the 50% or zero logit difficulty point (Chart 75). The more difficult the item, the more information developed, from 0.49 to 1.87 for 95% right count to a maximum at 54% and 45% right count. [No item on the test had a difficulty of 50%.]

Table 35

Chart 76

The sum of information (59.96) by item difficulty level and student score level is tabled in Table 35 and plotted as the test information function in Chart 76. This test does not do a precise job of assessing student ability. The test was most precise (19.32) at the 16 right count/76% right location. [Location can be designated by measure (logit), input raw score (red) or output expected score (Table 33b).]

The item with an 18 right count/92% right difficulty (Table 35) did not contribute the most information individually but did as a group of three items (9.17). The three highest scoring, easiest, items (counts of 19, 20, and 21) are just too easy for a standardized test but may be important survey items needed to verify knowledge and skills for this class of high performing students. None of these three items reached an information level maximum of 1/4. [It now becomes apparent how items can be selected to produce a desired test information function.]

The more information available is interpreted as greater precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q)) on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT + WG)^2 or (3 x 3)/36 = 1/4].

Each item information function spans the range of student scores on the test (Chart 76). Each item information function measures student ability most precisely near the point that item difficulty and student ability match (50% right) along the IRT S-curve. [The more difficult an item, the more ability students must have to mark correctly 50% of the time. Student ability is the number correct on the S-curve. Item difficulty is the number wrong on the S-curve (see more at Rasch Model Audit).]

Extracting item information functions from a data table provides a powerful tool (a test information function) for psychometricians to customize a test (page 127, Maryland 2010). A test can be adjusted for maximum precision (minimum CSEM) at a desired cut point.

The bright side of this is that the concept of “information” (not applicable to CTT), and the ability to put student ability and item difficulty on one scale, gives psychometricians powerful tools. The dark side is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the NCLB era, as psychometrics has made marked improvements, the student mark data it is being supplied has remained in the casino arena: Mark an answer to each question (even if you cannot read or understand the question), do not guess, and hope for good luck on test day.

The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak psychometric performance. Standardized testing has migrated away from letting students report what they know and can do to a lottery that compares their performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day.

The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.]

It does not have to be that way. Knowledge and Judgment Scoring (classroom friendly) and the partial credit Rasch model (that is included in the software states use) can both update traditional multiple-choice to the levels of thinking required by the common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Multiple-Choice Reborn

Followers

Blog Archive

About Me

Wednesday, October 8, 2014

Customizing Test Precision - Information Functions

No comments:

Post a Comment