6
I used Table 8 (test reliability) as the foundation for the
test reliability engine (Table 9).
The whole point of doing so was to provide a means of seeing the
interactions when marks (Item scores of 1 and 0) are changed in a row or a
column.
I removed the six most left columns from Table 8 as they are
not needed after verifying the ANOVA table data in the previous post. The ANOVA
Between Row and Count values (yellow) are converted from the normal Between Row
and Count values.
The first thing I
noticed was that rounding errors are no longer a problem with everything on one
Excel worksheet. The results on Table 9 have been edited into prior posts.
Table 9 consists of the mark scores (1’s and 0’s) in a central cell field (22 students by 21 items). With the exception of the conversion from normal values to ANOVA values based on the Grand Mean (0.799), all other values are the same as on Table 8.
Table 9 consists of the mark scores (1’s and 0’s) in a central cell field (22 students by 21 items). With the exception of the conversion from normal values to ANOVA values based on the Grand Mean (0.799), all other values are the same as on Table 8.
Test reliability is calculated with the KR20 and Cronbach’s
alpha (0.29) as shown on Table 6. Table 9 contains an explained ANOVA table for
between rows (student scores).
The second thing
I learned was that sorting 1’s and 0’s in item columns so that all 1’s were at
the top of the column and all 0’s were at the bottom produced a marked change
in test reliability. This did not change item difficulty.
Any item with all 1’s in one group and all 0’s in another is
set for maximum discrimination. Increasing discrimination increases test
reliability because increasing discrimination increases the variation within
student scores.
This makes sense. A test that accurately groups those who
know and those who do not know is more reliable than one in which the marks
scored 1 and 0 are mixed in a Guttman table.
Download TREngine for MAC and PC: TREngine.xls or TREngine.xlsm and
save, or run in your browser. (When it does not work, some helpful information
is frequently offered by the operating system.)
Deleting an item and replacing it to find which items
contribute the most, or the least, to test reliability has been automated.
Select the item number (ITEM #) in the bottom row of Table 9. Then click the
Toggle button for your results. Click the Toggle button again to restore the
item before selecting another item.
A scatter chart from all 21 single item deletions indicates
that difficulty is not the primary factor in test reliability. Deleting the two
most negative discriminating items increased test reliability the most.
Deleting the most discriminating item decreased test reliability the most. The
Spearman-Brown prediction formula estimated that a test reliability of 0.28
would be expected, after decreasing the number of items from 21 to 20, when
doing the deletions. The test
reliability for all 21 items was 0.29.
The third thing I
learned was that a 22 by 21 matrix is very unstable. I could only detect this
with all four of the discussed statistics on one active Excel sheet. Changing a
single mark from right to wrong or wrong to right in over 25 cells resulted in a
range of change from 0.29 to a low of 0.21 to a high of 0.36 in test
reliability. Cells around the edge of the cell field seemed to be the most sensitive. This
range in sensitivity, suggests there is more information in this matrix than
just harvesting variation with the Mean SS or Variance. Winsteps harvests
unexpectedness from the matrix.
Table 9 combines four education statistics (count, average,
standard deviation, and test reliability). It clearly shows that the more items
on the test (the more Variance summed) and the more discriminating the items,
the higher the test reliability. Table 9 also provides an easy way to explore
ALL of the effects of changing an item or even a single mark. I could not have
finished the last post without using it. Understanding is having relationships
in mind. Table 9 dynamically relates facts, which in the traditional case, are usually
presented in isolation.
[To use the Test Reliability Engine for other combinations
than a 22 by 21 table requires adjusting the central cell field and the values
of N for student and item. Then drag active cells over any new similar cells
when you enlarge the cell field. You may need to do additional tweaking. The
percent Student Score Mean and Item Difficulty Mean must be identical.
To reduce the cell field, use “Clear Contents” on the excess
columns and rows on the right and lower sides of the cell field. Include the six
cells that calculate SS that are below items and to the right of students
scores. Then manually reset the number of students and items. You may need
additional tweaking. The percent Student Score Mean and Item Difficulty Mean
must be identical.]
A password is used to prevent unwanted changes to occur. The
password is “PUP522”.
- - - - - - - - - - - - - - - - - - - -
-
Free software to help you and your students
experience and understand how to break out from traditional multiple choice (TMC) to Knowledge
and Judgment Scoring (KJS) (tricycle to bicycle):
No comments:
Post a Comment