Wednesday, May 15, 2013

Visual Education Statistics - Visual Education Statistics Engine

10

The Visual Education Statistics Engine (VESEngine) contains all six of the commonly used education statistics (Table 15).

The relationship between the first five seems clear. Item discrimination, the sixth statistic in the series, needs a bit more work.

The six visual education statistics in the VESEngine (Table 15):
 The Visual Education Statistics Engine 1.     Count The number of right marks for each student is listed under RT; the number of right marks for each item by RIGHT. 2.     Average The average student score is listed under SCORE MEAN; the average of right marks for each item by MEAN. 3.     Standard Deviation The standard deviation (SD) for student scores is listed under BETWEEN ROW OR STUDENT as N SD and N – 1 SD for large and small samples. 4.     Test Reliability The N – 1 test reliability is listed for KR20 and Cronbach’s alpha.  The N sources for the calculation are color coded. Select an ITEM # and then click the TR Toggle button to view the effect of removing an item from the test. 5.     Standard Error of Measurement (SEM) The SEM calculation is listed with the N – 1 sources color coded. This ends the sequence of calculations dependent upon the previous statistic. 6.     Item Discrimination Click the Pr Toggle button to view the UNCORRECT and CORRECT N – 1 item discrimination values.

The VESEngine is now ready to explore a number of things and relationships. The goal is to make traditional multiple-choice measurements more meaningful and useful. You can start by changing single marks or pairs of marks. The engine will do the work of recalculating the entire table except for item discrimination; that requires clicking the Pr Toggle button.

I have been concerned with how the calculations were made as much as why they were being made. This series needs to end with consideration of what meaning is assigned to the calculations.  The six statistics present three different views:
 Numbers You can Count (Descriptive) COUNT and AVERAGE A Combination of Count and Prediction STANDARD DEVIATION OF THE MEAN STANDARD ERROR OF MEASUREMENT Predictive Ratios without Dimensions TEST RELIABILITY and ITEM DISCRIMINATION

I loaded a perfect Guttman table into VESEngine and renamed it VESEngineG (Table 16).

I compared the item analysis results from Nursing124 and a perfect Guttman table to get an idea of what the VESEngine could do.
 Statistic Nursing124 (22x21) Guttman Table (21x20) Student Scores 16.77 80% 10 50% Test Reliability 0.29 0.95 Item Discrimination Corrected 0.09 0.52 Standard Deviation, N – 1 2.07 9.86% 6.20 31.00% Standard Error of Measurement 1.74 8.31% 1.35 6.77%

The data sets represent two different types of classes. The Nursing124 data are from a class preparing for state licensure exams (80% average class score). Mastery is the only level of learning that matters. The Guttman table is both theoretical and near to the design used on standardized tests (50% average score). These average scores are descriptive statistics.

The two predictive statistics, test reliability and item discrimination, values are markedly different for the two tests. The Guttman table yielded a test reliability of 0.95 that puts it into a standardized test ranking. It did this with an average item discrimination ability of only 0.52. The Nursing124 data resulted in an item discrimination ability of only 0.09. Both of these values are corrected values. The value of 0.09 is just below the limit for detecting item discrimination (0.10) and is confirmed by the ANOVA F test as just below the limit for being different from (the many classroom and testing aspects of) chance. This makes sense.

[Power Up Plus (PUP) printed out a value of 0.26 for the average item discrimination. This in the uncorrected value for the Nursing123 data. This is the only error I found in PUP: The average item discrimination was not updated when the routine for correcting the item discrimination was added.]

The Nursing124 data Standard Deviation (2.07 or 9.86%) is much smaller than the SD (6.20 or 31.00%) for the Guttman table. This makes sense. The mastery data have a much smaller range than the Guttman table data. What is most interesting is that in spite of the larger SD range for the Guttman table data, it resulted in a smaller SEM (1.35 or 6.77%) than the Nursing123 mastery data (1.74 or 8.31%).

Even though the Guttman table data have a SD 3 times that of the Nursing124 data, by having an item discrimination over 5 times the Nursing124 data, they produced a Standard Error of Measurement a bit less than the Nursing124 data. This interaction makes more sense when visualized (Chart 26). The similarity of the SEMs indicates that widely differing tests can yield comparable results.

Item discrimination has been improved over the years. With paper
and pencil, the Pearson r was difficult enough. Computers enable calculations that remove the right mark on the item in hand from the related student score before calculating each item’s discrimination ability. No correction is needed. The difference in uncorrected past and corrected current results is striking (Chart 27). Also see the previous post on item discrimination.

The literature often mentions that the best standardized test is one with many items near the cut score in difficulty and with a few widely scattered in difficulty. At this time I can see that the widely scattered items are needed to produce the desired range of scores. Many items near the cut score produce a lower SD and a lower SEM. You can use the VESEngine to explore different distributions of item difficulty and student ability.

Is there an optimum relationship in an imperfect world? Or will the safe way to proceed with standardized tests remain: 1. Administer the test; 2. View the preliminary results; and 3. Adjust to the desired final result? IMHO, this method does in no way reduce the importance of highly skilled test makers working from predictions based on field tests or trial items included in operational tests.

[The VESEngine has two control buttons that function independently. The Pearson r Button refreshes item discrimination. The test reliability button (TR Toggle) removes a selected item from the test and then restores it on the second click.

Set a smaller matrix by removing excess cells with Remove Contents, as shown on the perfect Guttman table (Table 16) where the most right column and lowest row have been cleared of contents. The student score mean and item difficulty mean (blue) were then reset from 22 and 21 to 21 and 20.

Create a larger matrix by inserting rows within the table (not at the top or bottom). Insert columns at column S or 19. Then drag the adjacent active cells to complete the marginal cells. Finally edit the two button TableX and TableY values in Macro1 and Macro2 to match the overall size of your table.

Please check your first results with care as I have found it very easy to confound results with typos and with unexpected changes in selected ranges, especially when copying and enlarging the VESEngine.]

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):