12
I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information
functions (IIF). This post makes use of that observation to improve the
Nurse124 data set that generated the set of IFFs in Chart 75.
I observed that Tables 33 and 34, in the prior post,
contained no items with difficulties below 45%. The item information functions
(IIF) were also skewed (Chart 75). This is not the symmetrical display
associated with the Rasch IRT model. I reasoned that adding a balanced set of
items would increase the number of IFFs without changing the average item
difficulty.
Table 36a shows the addition of a balanced set of 22 items
to the Nurse124 data set of 21 items. As each lower ranking item was added, one
or more high ranking items were added to keep the average test score near 80%.
This table added six lower ranking items and 16 higher scoring items resulting
in an average score of 79% and 43 items total.
Table 36 |
The average item difficulty for the Nurse124 data set was
17.57 and the expanded set was 17.28. The average test score of 80% came in as
79%. Student scores (ability) also remained about the same. [I did not take the
time to tweak the additions for a better fit.] Both item difficulty and student
score (ability) remained about the same.
The conditional standard error of measurement (CSEM) did
change with the addition of more items (Chart 79 below). The number of cells
containing information expanded from 99 to 204 cells. The average right count
student score increased from 17 to 34.
Table 36c shows the resulting item information functions
(IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set
of 9 different student scores now contains 12 different scores, however the
range of student scores is comparable between the two sets. This makes sense as
the average test scores are similar and the student scores are also about the
same.
Table 37 |
Chart 77 |
Chart 77 (Table 37) shows the 17 IIFs as they spread across the
student ability range of 12 rankings (student score right count/% right). The
trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near
the average test score of 79%. This was expected as the maximum information
value within an IIF occurs when the
item difficulty and student ability score match. [The three bottom traces on
Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in
relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]
Even more important is the way the traces are increasingly
skewed the further the IIFs are away from this maximum, 11/50%, trace (blue
square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average
test score, produced the identical total information (1.41) from both the
Nurse124 and the supplemented data sets. But these values also drifted apart
for the two data sets for IIFs of higher and lower difficulty.
Two IIFs near the 50% difficulty point delivered the maximum
information (2.17). Here again is evidence that prompts psychometricians to
work closely to the 50% or zero logit point to optimize their tools when
working on low quality data (limiting scoring only to right counts rather than also
offering students the option to assess their judgment to report what is
actually meaningful and useful; to assess their development toward being a successful,
independent, high quality achiever). [Students that only need some guidance
rather than endless “re-teaching”; that, for the most part, consider right
count standardized tests a joke and a waste of time.]
Tabel 38 |
Chart 79 |
Chart 79 summarizes the relationships between the Nurse124
data, the supplemented data (adding a balanced set of items that keeps student
ability and item difficulty unchanged), and the CTT and IRT data reduction
methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM)
for comparison. In general, both CTT (blue) and IRT inverted (red) produced comparable CSEM values.
Adding 22 items increased the CTT Test SEM from 1.75 to
2.54. The standard deviation (SD) between student test scores increased from
2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and
57% with a difference of 27, or an improvement in precision of 27/84 or 32%.
Chart 79 also makes it very obvious that the higher the
student test score the lower the CTT CSEM, the more precise the student score measurement,
the less error. That makes sense.
The above statement about a CTT CSEM must be related to a
second statement that the more item information, the greater the precision of
measurement by the item at this student score rank. The first statement
harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks)
in Table 36c.
The binomial
variance CTT CSEM view is then comparable to the reciprocal or inverted
(1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue,
CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced
similar results even with an average test score of 79% that is 29 percentage
points away from the 50%, zero logit, IRT optimum performance point.
The second statement harvests variance, item information
functions, in Table 36c from columns of
probabilities (of right marks). Layering one IIF on top of another across
the student score distribution yields the test information function (Chart 78).
The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting
a right answer that were generated
from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet,
at the count only right mark level, they deliver very similar results, with the
exception of the IIF from IRT analysis that the CTT analysis does do.
- - - - - - - - - - - - - - - - - - - - -
The Best of the Blog - FREE
The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.
Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.