Multiple-Choice Reborn: November 2014

I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information functions (IIF). This post makes use of that observation to improve the Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post, contained no items with difficulties below 45%. The item information functions (IIF) were also skewed (Chart 75). This is not the symmetrical display associated with the Rasch IRT model. I reasoned that adding a balanced set of items would increase the number of IFFs without changing the average item difficulty.

Table 36a shows the addition of a balanced set of 22 items to the Nurse124 data set of 21 items. As each lower ranking item was added, one or more high ranking items were added to keep the average test score near 80%. This table added six lower ranking items and 16 higher scoring items resulting in an average score of 79% and 43 items total.

Table 36

The average item difficulty for the Nurse124 data set was 17.57 and the expanded set was 17.28. The average test score of 80% came in as 79%. Student scores (ability) also remained about the same. [I did not take the time to tweak the additions for a better fit.] Both item difficulty and student score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did change with the addition of more items (Chart 79 below). The number of cells containing information expanded from 99 to 204 cells. The average right count student score increased from 17 to 34.

Table 36c shows the resulting item information functions (IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set of 9 different student scores now contains 12 different scores, however the range of student scores is comparable between the two sets. This makes sense as the average test scores are similar and the student scores are also about the same.

Table 37

Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the student ability range of 12 rankings (student score right count/% right). The trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near the average test score of 79%. This was expected as the maximum information value within an IIF occurs when the item difficulty and student ability score match. [The three bottom traces on Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly skewed the further the IIFs are away from this maximum, 11/50%, trace (blue square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average test score, produced the identical total information (1.41) from both the Nurse124 and the supplemented data sets. But these values also drifted apart for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum information (2.17). Here again is evidence that prompts psychometricians to work closely to the 50% or zero logit point to optimize their tools when working on low quality data (limiting scoring only to right counts rather than also offering students the option to assess their judgment to report what is actually meaningful and useful; to assess their development toward being a successful, independent, high quality achiever). [Students that only need some guidance rather than endless “re-teaching”; that, for the most part, consider right count standardized tests a joke and a waste of time.]

Chart 78

Tabel 38

The test information function for the supplemented data set Is the sum of the information in all 17 item information functions (Table 38 and Chart 78). It took 16 easy items to balance 6 difficult items. The result was a marked increase in precision at the student score levels between 30/70% and 32/74%. [More at Rasch Model Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124 data, the supplemented data (adding a balanced set of items that keeps student ability and item difficulty unchanged), and the CTT and IRT data reduction methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM) for comparison. In general, both CTT (blue) and IRT inverted (red) produced comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to 2.54. The standard deviation (SD) between student test scores increased from 2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and 57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the student test score the lower the CTT CSEM, the more precise the student score measurement, the less error. That makes sense.

The above statement about a CTT CSEM must be related to a second statement that the more item information, the greater the precision of measurement by the item at this student score rank. The first statement harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks) in Table 36c.

The binomial variance CTT CSEM view is then comparable to the reciprocal or inverted (1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue, CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced similar results even with an average test score of 79% that is 29 percentage points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information functions, in Table 36c from columns of probabilities (of right marks). Layering one IIF on top of another across the student score distribution yields the test information function (Chart 78).

The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting

a right answer that were generated from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet, at the count only right mark level, they deliver very similar results, with the exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Multiple-Choice Reborn

Followers

Blog Archive

About Me

Wednesday, November 12, 2014

Information Functions - Adding Balanced Items