Wednesday, April 17, 2013

Visual Education Statistics - Test Reliability Engine

6

I used Table 8 (test reliability) as the foundation for the test reliability engine (Table 9).  The whole point of doing so was to provide a means of seeing the interactions when marks (Item scores of 1 and 0) are changed in a row or a column.

I removed the six most left columns from Table 8 as they are not needed after verifying the ANOVA table data in the previous post. The ANOVA Between Row and Count values (yellow) are converted from the normal Between Row and Count values.

The first thing I noticed was that rounding errors are no longer a problem with everything on one Excel worksheet. The results on Table 9 have been edited into prior posts.

Table 9 consists of the mark scores (1’s and 0’s) in a central cell field (22 students by 21 items). With the exception of the conversion from normal values to ANOVA values based on the Grand Mean (0.799), all other values are the same as on Table 8.

Test reliability is calculated with the KR20 and Cronbach’s alpha (0.29) as shown on Table 6. Table 9 contains an explained ANOVA table for between rows (student scores).

The second thing I learned was that sorting 1’s and 0’s in item columns so that all 1’s were at the top of the column and all 0’s were at the bottom produced a marked change in test reliability. This did not change item difficulty.

Any item with all 1’s in one group and all 0’s in another is set for maximum discrimination. Increasing discrimination increases test reliability because increasing discrimination increases the variation within student scores.

This makes sense. A test that accurately groups those who know and those who do not know is more reliable than one in which the marks scored 1 and 0 are mixed in a Guttman table.

Download TREngine for MAC and PC: TREngine.xls or TREngine.xlsm and save, or run in your browser. (When it does not work, some helpful information is frequently offered by the operating system.)

Deleting an item and replacing it to find which items contribute the most, or the least, to test reliability has been automated. Select the item number (ITEM #) in the bottom row of Table 9. Then click the Toggle button for your results. Click the Toggle button again to restore the item before selecting another item.

A scatter chart from all 21 single item deletions indicates that difficulty is not the primary factor in test reliability. Deleting the two most negative discriminating items increased test reliability the most. Deleting the most discriminating item decreased test reliability the most. The Spearman-Brown prediction formula estimated that a test reliability of 0.28 would be expected, after decreasing the number of items from 21 to 20, when doing the deletions.  The test reliability for all 21 items was 0.29.

The third thing I learned was that a 22 by 21 matrix is very unstable. I could only detect this with all four of the discussed statistics on one active Excel sheet. Changing a single mark from right to wrong or wrong to right in over 25 cells resulted in a range of change from 0.29 to a low of 0.21 to a high of 0.36 in test reliability. Cells around the edge of the cell field seemed to be the most sensitive. This range in sensitivity, suggests there is more information in this matrix than just harvesting variation with the Mean SS or Variance. Winsteps harvests unexpectedness from the matrix.

Table 9 combines four education statistics (count, average, standard deviation, and test reliability). It clearly shows that the more items on the test (the more Variance summed) and the more discriminating the items, the higher the test reliability. Table 9 also provides an easy way to explore ALL of the effects of changing an item or even a single mark. I could not have finished the last post without using it. Understanding is having relationships in mind. Table 9 dynamically relates facts, which in the traditional case, are usually presented in isolation.

[To use the Test Reliability Engine for other combinations than a 22 by 21 table requires adjusting the central cell field and the values of N for student and item. Then drag active cells over any new similar cells when you enlarge the cell field. You may need to do additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.

To reduce the cell field, use “Clear Contents” on the excess columns and rows on the right and lower sides of the cell field. Include the six cells that calculate SS that are below items and to the right of students scores. Then manually reset the number of students and items. You may need additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.]

A password is used to prevent unwanted changes to occur. The password is “PUP522”.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out from traditional multiple choice (TMC) to Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):