Wednesday, May 22, 2013

Visual Education Statistics - Basic Relationships


                                                              11
The first ten posts in this series developed a visual education statistics (VES) engine that relates six statistics on one Excel spreadsheet. This post explores their relationships by switching right and wrong marks (1 and 0) in matched pairs and in unmatched single switches at increasing distances from the diagonal equator.

A Guttman table is an extreme distribution with each student receiving a different score. Each item also has a different difficulty. Item discrimination is set at the maximum. There is only one possible distribution for this 21 student by 20 item test (Table 17). (The Excel .xlsm or .xls version is available from Table17@nine-patch.com.)

The squared student score deviations are at zero at the test score mean and at a maximum (100) at the extremes. The opposite is the case for item sums of squares (SS) with a maximum of 5.24 at the mean of 10.5 and a minimum of 0.95 at the extremes. This makes sense as there is greater variation between student score extremes and less within item difficulty extremes (Table 17).

The standard deviation (SD) of student scores decreased (6.205 to 6.050) as matched pair switching progressed from the mean to the extreme in a linear manner (Chart 28). This makes sense as the student score deviations normally increase at the extremes. Switching marks reduced these extremes.

Test reliability also fell as matched pair switching progressed from the mean to the extreme in a linear manner (Chart 29). This makes sense as the student score N MEAN SS decreased as the switching progressed from the mean to the extreme (36.381 to 34.857 or 1.524) and as the item N MEAN SS only decreased (-3.492 to -3.574 or 0.082).

The standard error of measurement (SEM) increased linearly (1.354 to 1.423) as the switching progressed from the mean to the extreme (Chart 30). This too makes sense as a decrease in test reliability is related to an increase in the SEM.

Item discrimination (KR20 and Pearson r) decreased in a non-linear manner (Chart 31) as the switching progressed from the mean to the extreme (from 0.676 to 0.637). This also makes sense as the greater the change from a perfect Guttman table, the lower the item discrimination. Switched marks that are the farthest from the diagonal equator are the most unexpected marks.

A second scan of the Guttman table with an unbalanced single switch of right and wrong produced the same relationships as the balanced switch scan. The spreadsheet (Table 16) needed to be set to three decimal spaces to capture the detail with a minimum of rounding errors (Table 17).

The VES engine is showing three linear relationships (SD, test reliability, and SEM) and one nonlinear relationship (item discrimination). Just one switch of 1 to 0 or 0 to 1 can be detected in all four statistics. I find it interesting that such detail can be captured from a 21 x 20 table.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 15, 2013

Visual Education Statistics - Visual Education Statistics Engine


                                                                    10

The Visual Education Statistics Engine (VESEngine) contains all six of the commonly used education statistics (Table 15).


The relationship between the first five seems clear. Item discrimination, the sixth statistic in the series, needs a bit more work.

The six visual education statistics in the VESEngine (Table 15):
The Visual Education Statistics Engine
1.     Count
The number of right marks for each student is listed under RT; the number of right marks for each item by RIGHT.
2.     Average
The average student score is listed under SCORE MEAN; the average of right marks for each item by MEAN.
3.     Standard Deviation
The standard deviation (SD) for student scores is listed under BETWEEN ROW OR STUDENT as N SD and N – 1 SD for large and small samples.
4.     Test Reliability
The N – 1 test reliability is listed for KR20 and Cronbach’s alpha.  The N sources for the calculation are color coded. Select an ITEM # and then click the TR Toggle button to view the effect of removing an item from the test.
5.     Standard Error of Measurement (SEM)
The SEM calculation is listed with the N – 1 sources color coded. This ends the sequence of calculations dependent upon the previous statistic.
6.     Item Discrimination
Click the Pr Toggle button to view the UNCORRECT and CORRECT N – 1 item discrimination values.

The VESEngine is now ready to explore a number of things and relationships. The goal is to make traditional multiple-choice measurements more meaningful and useful. You can start by changing single marks or pairs of marks. The engine will do the work of recalculating the entire table except for item discrimination; that requires clicking the Pr Toggle button.

I have been concerned with how the calculations were made as much as why they were being made. This series needs to end with consideration of what meaning is assigned to the calculations.  The six statistics present three different views:
Numbers You can Count (Descriptive)
COUNT and AVERAGE
A Combination of Count and Prediction
STANDARD DEVIATION OF THE MEAN

STANDARD ERROR OF MEASUREMENT
Predictive Ratios without Dimensions
TEST RELIABILITY and ITEM DISCRIMINATION

I loaded a perfect Guttman table into VESEngine and renamed it VESEngineG (Table 16).


Download free from http://www.nine-patch.com/download/VESEngine.xlsm or .xls (Table 15).
Download free from http://www.nine-patch.com/download/VESEngineG.xlsm or .xls (Table 16).


I compared the item analysis results from Nursing124 and a perfect Guttman table to get an idea of what the VESEngine could do.
Statistic
Nursing124 (22x21)
Guttman Table (21x20)
Student Scores
16.77
80%
10
50%
Test Reliability
0.29
0.95
Item Discrimination Corrected
0.09
0.52
Standard Deviation,
N – 1
2.07
9.86%
6.20
31.00%
Standard Error of Measurement
1.74
8.31%
1.35
6.77%

The data sets represent two different types of classes. The Nursing124 data are from a class preparing for state licensure exams (80% average class score). Mastery is the only level of learning that matters. The Guttman table is both theoretical and near to the design used on standardized tests (50% average score). These average scores are descriptive statistics.

The two predictive statistics, test reliability and item discrimination, values are markedly different for the two tests. The Guttman table yielded a test reliability of 0.95 that puts it into a standardized test ranking. It did this with an average item discrimination ability of only 0.52. The Nursing124 data resulted in an item discrimination ability of only 0.09. Both of these values are corrected values. The value of 0.09 is just below the limit for detecting item discrimination (0.10) and is confirmed by the ANOVA F test as just below the limit for being different from (the many classroom and testing aspects of) chance. This makes sense.

[Power Up Plus (PUP) printed out a value of 0.26 for the average item discrimination. This in the uncorrected value for the Nursing123 data. This is the only error I found in PUP: The average item discrimination was not updated when the routine for correcting the item discrimination was added.]

The Nursing124 data Standard Deviation (2.07 or 9.86%) is much smaller than the SD (6.20 or 31.00%) for the Guttman table. This makes sense. The mastery data have a much smaller range than the Guttman table data. What is most interesting is that in spite of the larger SD range for the Guttman table data, it resulted in a smaller SEM (1.35 or 6.77%) than the Nursing123 mastery data (1.74 or 8.31%). 

Even though the Guttman table data have a SD 3 times that of the Nursing124 data, by having an item discrimination over 5 times the Nursing124 data, they produced a Standard Error of Measurement a bit less than the Nursing124 data. This interaction makes more sense when visualized (Chart 26). The similarity of the SEMs indicates that widely differing tests can yield comparable results. 

Item discrimination has been improved over the years. With paper
and pencil, the Pearson r was difficult enough. Computers enable calculations that remove the right mark on the item in hand from the related student score before calculating each item’s discrimination ability. No correction is needed. The difference in uncorrected past and corrected current results is striking (Chart 27). Also see the previous post on item discrimination.

The literature often mentions that the best standardized test is one with many items near the cut score in difficulty and with a few widely scattered in difficulty. At this time I can see that the widely scattered items are needed to produce the desired range of scores. Many items near the cut score produce a lower SD and a lower SEM. You can use the VESEngine to explore different distributions of item difficulty and student ability.

Is there an optimum relationship in an imperfect world? Or will the safe way to proceed with standardized tests remain: 1. Administer the test; 2. View the preliminary results; and 3. Adjust to the desired final result? IMHO, this method does in no way reduce the importance of highly skilled test makers working from predictions based on field tests or trial items included in operational tests.

Download free from http://www.nine-patch.com/download/VESEngine.xlsm or .xls (Table 15).
Download free from http://www.nine-patch.com/download/VESEngineG.xlsm or .xls (Table 16).

[The VESEngine has two control buttons that function independently. The Pearson r Button refreshes item discrimination. The test reliability button (TR Toggle) removes a selected item from the test and then restores it on the second click.

Set a smaller matrix by removing excess cells with Remove Contents, as shown on the perfect Guttman table (Table 16) where the most right column and lowest row have been cleared of contents. The student score mean and item difficulty mean (blue) were then reset from 22 and 21 to 21 and 20.

Create a larger matrix by inserting rows within the table (not at the top or bottom). Insert columns at column S or 19. Then drag the adjacent active cells to complete the marginal cells. Finally edit the two button TableX and TableY values in Macro1 and Macro2 to match the overall size of your table.

Please check your first results with care as I have found it very easy to confound results with typos and with unexpected changes in selected ranges, especially when copying and enlarging the VESEngine.]

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 8, 2013

Visual Education Statistics - Item Discrimination Engine


                                                                     9
Statistic Six: Item discrimination, the last statistic in this series of posts, captures the ability of an item to group students by what they know (and by what they have yet to learn, with Knowledge and Judgment Scoring or partial credit Rasch model scoring). Previous posts have indicated that this ability may be primary in selecting items for standardized tests. It is also important in the classroom. Discriminating items produce the spread of scores needed for setting grades in schools designed for failure.

I left this statistic to last as it is a bit different from the others. It is more complex and difficult to calculate. However, the standard error of measurement (SEM) engine, post 8, only needed one more step to have the numbers in hand to calculate the Pearson r estimate of item discrimination.

Pearson worked out his item discrimination in a manner that follows the previous posts. He did this by 1895, long before we had personal computers. As a consequence we now have two versions, called the original uncorrected estimate (Excel Pearson function) and the corrected estimate. There is also a shortcut for traditional multiple-choice (TMC) tests: the point biserial r (PBR) I consider at the end of this post.

A visual presentation of the Pearson item discrimination calculation follows (see Table 11 for the calculations).

First, the marks in the Item 4 column on the Guttman table (Table 12) are counted (10), the average obtained (0.45 out of 22), and the deviations from the mean obtained (Chart 20).  

The same process is carried out on the student score columns (RT of 369 and SCORE MEAN of 16.77 out of 22, see Chart 21).

When each of these two charts is summed, it adds to zero. This time the individual values are not squared to make them all positive as in Charts 22 (scores) and 23 (items). Instead the related item and score deviations are multiplied to produce positive and negative values (Chart 24 and Table 11) that sum to 13.27.


The item discrimination is then a ratio between two sums of squares (SS). This operation is carried out for each item on the test:


Multiplying the two SSs in the denominator (after taking their square roots) changes negative values to positive values and yields a grand SS (2.34 x 9.49 = 22.21). The resulting ratio is the discrimination ability of the item. It can range from a minus one to a positive one. Values above 0.9 are characteristic of standardized tests. Values for classroom tests will be discussed later.

Table 12 contains an Item Discrimination Engine you can use to explore the discrimination ability of individual items. [Download free from http://www.nine-patch.com/download/IDEngine.xlsm or .xls]

The point biserial r (PBR) provides an additional glimpse into what is taking place (Table 13).  The difference between the average right marks and wrong marks (18.1 – 15.67 = 2.43) is standardized by dividing by the standard deviation (2.43/2.07 = 1.176). Multiplying the difference between right and wrong mark means in standard units (1.176) by the proportion (p and q) of right and wrong marks, Sqrt(0.45 x 0.55) = 0.2475,  yields the PBR item discrimination of 0.59.









The real value or meaning of an item discrimination rank seems to be a matter of tradition and advances in computing power. PUP 5.20 prints out corrected item discrimination values that I gave the following rankings for my classroom tests:



[The PBR only works for traditional multiple-choice, that only ranks students. PUP contains the Pearson r that is required for Knowledge and Judgment Scoring, an actual assessment of what students know and can do, that is meaningful and useful in future assignments.]

Item discrimination weights each right and wrong mark with the related student score. Different column mark patterns produce different results. Unlike test reliability, when calculating item discrimination the order, or pattern, of marks is important. Items of the same difficulty can have very different discrimination ability, for example, items 11, 14, 15, 16 and 18 with a difficulty of 91% and a range of item discrimination of -0.02 to 0.58 (Chart 25).


Selecting difficult items is not sufficient to maximize test reliability. The primary need is to write discriminating items. The Nursing124 data delivered discriminating items at all levels of difficulty from 45% to 91% (Chart 25).

The item discrimination results seemed to me to be as unpredictable as test reliability results. IMHO only a visual education statistics engine that combines all six statistics can readily display the interactions.

The standard error of student score measurement (SEM), the test reliability (KR20, and alpha), and the item discrimination (Pearson and PBR) have unpredictable interactions. The Test Performance Profile from PUP 5.20 brings these together in one table for easy use in the classroom by students and teachers (and other interested persons) but lacks the flexibility of a single sheet spreadsheet engine.

[PUP 5.20 only prints the PBR ranks as an efficient aid for teachers. An additional aid is provided by sorting the discriminating items on PUP 5.20, sheet 3a. Student Counseling Mark Matrix with Mastery/Easy, Unfinished, and Discriminating (MUD) Analysis.]

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 1, 2013

Visual Education Statistics - Teacher Effectiveness


                                                                 8
The information that needed to be related in post 7, became too long for one post. Post 7 contains the SEMEngine; all five of the related statistics on one spreadsheet.  This post relates a collection of stuff that gives those statistics additional meaning; a bit of understanding needed to use them properly.

The SEMEngine, in the previous post, can produce the unpredictable statistics relevant to classroom tests and standardized tests. But a full understanding of these statistics requires a discussion of a second standard error and the two methods of scoring multiple-choice (traditional multiple-choice, TMC, and Knowledge and Judgment Scoring, KJS); partial and full disclosure of that a student knows and can do.

The standard deviation (SD) of the group test score and the standard error of measurement (SEM) of the average student test score provide guidance in constructing standardized tests as predictive inputs. These statistics are also helpful in describing classroom test results. The first refers to test results from the class or group taking the test, the average group score; the second, to the average student score in the class. They are two different perspectives of the same average score. They have different uses.

There is a second standard error, the standard error of the mean (SE) that permits comparison between group test scores. [I am belaboring this topic as the two standard errors (of the mean and of measurement), the abbreviation (SEM) and even the SD can get confused (Standard error and Standard error vs. Standard error of measurement).

Chart 18 shows how the SEM of the average student score is reduced as more equivalent items are added to the Cantrell data of 14 items. A 50 item test is expected to yield a SEM of 5.15%. This is less than 1/3 the range of the SD. But even this would require an improvement of 3 x 5.15 = 15.45% for a significant increase in performance from one year or one test to the next. That is 1.5 times a traditional letter grade. To my knowledge, very few standardized tests use 50 items in any topic or skill area.

Chart 19 shows how the SE of the classroom or group test score is reduced as more equivalent items are added to the Cantrell data.  The SE has a finer resolution than the SEM. An improvement in class performance on a 50 item test, 3 x 2.57 = 7.71% would require only about a 3/4 letter grade to show a significant difference in the two test scores from two different classes or one class at two different times. This shows that it is easier to show a significant difference between the average scores from two tests than it is between two scores from the same student.

[The above can be generalized to support the traditional score range of 10% per letter grade.]

I retitled this post as “Teacher Effectiveness” after looking at the above two charts (18 and 19). These statistics provide a means of measuring teacher effectiveness; or at least ranking teacher effectiveness. To measure teacher effectiveness, the portion of students electing TMC or KJS on the test would also have to be included. 

[A class selecting mostly TMC is in a lower level of thinking classroom environment populated with passive pupils conditioned to mark an answer to every item. A class selecting mostly KJS is in a higher (all) level of thinking classroom environment populated with self-motivated, self-correcting high quality achievers who are mature enough to distinguish between what they have yet to learn and what they know and can do that can serve as the basis for further learning and instruction.]

Student development is as important as knowledge or skill. The CCSS movement promotes this idea too but without the simplicity of multiple-choice (in time and money).

These visualized statistical models of the real world have been found to have practical value in making predictions (a most expected mid-point on a range of possibilities). However, what we feed into these statistics determines the validity and usefulness of the results. The concrete reality that you got a score of 50% on a classroom test becomes transformed into an abstract prediction that, +- 1 SD, that score (and your next score on an equivalent test) just might have been anywhere between 30% and 70% on an equivalent standardized test. And further, using the SEM, the range may be reduced to between 45% and 55% (generalized from Table 18).

Test scores (and these first five reviewed statistics) are easily manipulated by the selection of questions on the test and how the test is scored. The traditional multiple-choice test (forced-choice test) is a game with a built in handy-cap of over 20%. This manipulation of scores is so traditional (so hardened to change) that little thought is given to it with the exception of when elementary school students take their first multiple-choice tests.

Learning to lie is difficult for serious students; they know a best guess is not a reflection of their abilities. It is just sugar coating and a distraction from the ugly truth. Students with equal abilities, but receiving lower test scores, rightly feel cheated by their poor luck on test day. In time, these students just mark, finish the test, and then get back to their world where they do have some control.  Since there is no way of knowing if a right mark is a right answer or a lucky answer, there is no need to take the test seriously except for where their score falls in the class distribution (their rank).

[This practice is institutionalized when their class rank is provided in college admission documents.]

The traditional multiple-choice test (TMC) is fast, cheap, and marketed way beyond its valid ability to rank students IMHO. It is, as my students put it, Dumb testing. The statistics are not an accurate, honest and fair reflection of their individual abilities.

TMC IMHO drives students away from developing into self-motivated, self-correcting, high quality achievers.  Statistics will not change the outcome. There is a better (alternative) method of multiple-choice assessment, KJS, at no additional cost that will guide their development. An effective teacher motivates students to be ready to learn and to want to learn. 

A multiple-choice test can be used to permit students to report what they actually know, understand, and find useful as the basis for further learning and instruction. All that is required is an extraction of student judgment (something that is considered an essential part of almost all alternative and authentic assessments and soon the elaborate CCSS assessments). Please check out Smart testing: Knowledge and Judgment Scoring, partial credit Rasch model, and Confidence Based Assessment, for example. All three promote student development that yields high test scores, long term, and with a minimum of review.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, April 24, 2013

Visual Education Statistics - Standard Error of Measurement Engine


                                                                 7
We have now gone from the real world of counting and test scores through three stages (average, standard deviation and test reliability) of calculating and relating averages. The next step is to return as best as these abstract statistics can to the real world. The problem is that they only see, represent, the real world as portions of the normal curve of error (Standard error and Standard error vs. Standard error of measurement). They no longer see individual test scores.

If a student could take the same test several times, the scores would form a distribution with a mean and standard deviation (SD). That mean would be a best estimate of the student’s true score on that test. The SD would indicate the range of expected measurements. Some 2/3 of the time the next test score is expected to fall within one SD of the mean.

[This makes the same sense as a person going to a baseball game to watch a batter who has averaged a hit 50% of the time in his last 20 games. That is a descriptive use of the statistic (0.500). If the person bets the batter will do the same in the current game; that is a predictive use of the same statistic (0.500). “Past performance is no guarantee of future performance.” The SD gives us a “ballpark” idea of what may happen. The following statistic promises a better idea.]

The standard error of measurement (SEM), statistic five in this series, estimates an on average student score range centered on the class mean. It is not the specific location of the next expected test score. It is not a range tailored to each student. It is tailored to the average student in the class.

The accuracy and precision of this estimate is important. The range of the SEM sets the limit on how much of an increase in test score is needed, from one year to the next, to be significant. The smaller the range, the finer the resolution.

Students, of course, do not retake the same test many times to generate the needed scores for averaging. The best the psychometricians can do is to estimate the SEM using the SD of student scores (2.07) and the test reliability (0.29) in Table 10.

SEM = SQRT(MSrow) * SQRT(1 – KR20)
SEM = SQRT(4.28)*SQRT(1 – 0.29)
SEM = 2.07 * SQRT(0.71)
SEM = 2.07 * 0.84 = 1.75 or 8.31%

A portion of the average, N – 1, student score Variance (MSrow in the far right column on Table 10 of 4.28 or the SD of 2.07) is used to estimate the SEM.  The portion is determined by the test reliability, KR20 (0.29). The SEM for the Nurse124 data (1.75) can also be expressed as 8.31% (Table 10).

The SD for the student score mean (0.799 * 100 = 79.87%) was 9.85% (2.07/21). With a test reliability of only 0.29, the SEM is little better (smaller) than the score SD (SD 9.85% and SEM 8.31%).

Charts 15 and 16 show what the above actually looks like in the normal world.  Chart 15 is for a set of data (Nursing124) that is just below the boundary of the 5% level of significance by the ANOVA F test. The F test was 1.31 and the critical value was 1.62. The SEM curve (1.75 or 8.31%) is close to the normal SD of the average test score (2.07 or 9.85%). The SEM is only a 15.63% reduction from the SD.

Chart 16 is for a set of data (Cantrell) that is well above the boundary of the 5% level of significance by the ANOVA F test. The F test was 3.28 with a critical value of 2.04. The SEM curve (1.21 or 8.68%) is much narrower than the normal SD of the average test score (2.55 or 18.19%). The SEM is a 52.55% reduction from the SD.

This makes sense. It follows that the higher the test reliability, the lower (the shorter the range of) the SEM on a normal scale. Do these statistics really mean this? Most psychometricions believe they do.

 [Descriptive statistics are used in the classroom on each test. Rarely are specific predictions made. Standardized tests are marketed by their test reliability and SEM. This is the same change IMHO as changing from amateur to professional in sports. It is no longer how you play the game and having fun but winning. Every possible observation is subject to examination.]

I again made use of the process of deleting and restoring one item at a time to take a peek at how these statistics interact. [SEMEngine, Table 10, is hosted free at http://www.nine-patch.com/download/SEMEngine.xlsm and .xls.]

The SEM (red) is a much more stable statistic than the test reliability (blue dot) across a range of student scores from 55% to 95% (Chart 17). Two scales are involved: a ratio scale of 0 to 1 and a normal scale of right counts. The lowest trace (a ratio) on Chart 17 is inverted (second trace) and then multiplied by the top trace (SDrow in counts) to yield the SEM in counts.

Even more striking than the stability of the SEM (red) are the parallel traces of the student score standard deviation (SDrow, green dot) and the test reliability (KR20, blue dot). This makes sense. When the student scores spread out, the Variance (MSrow) also increases, which increases the test reliability (KR20). I was surprised to see the two so tightly related.

Chart 17 also includes the SQRT(1-KR20) (blue triangle). This inverts the KR20 (blue dot). The stable SEM (red) then results from multiplying this inverted value by the student score SDrow (green dot). This makes sense. Multiplying a number by its reciprocal yields one; but in this case, a two-step process includes two closely related numbers.
 
[In designing the forerunners of PUP, I discarded stable statistics as IMHO they seemed to be of little descriptive value in the classroom. That is not true for standardized tests where the goal is to use the shortest test possible composed of discriminating items (no mastery or unfinished items).]

The SEM engine now contains the first five of the six statistics commonly used in education. In the next post I will explore the relationship between the SEM of individual student scores and the SE of the mean of the class score, the average test score. These have little meaning in the classroom but are IMHO very important in understanding standardized testing.

[To use the Test Reliability and Standard Error of Measurement Engine for other combinations than a 22 by 21 table requires adjusting the central cell field and the values of N for student and item. Then drag active cells over any new similar cells when you enlarge the cell field. You may need to do additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.

To reduce the cell field, use “Clear Contents” on the excess columns and rows on the right and lower sides of the cell field. Include the six cells that calculate SS that are below items and to the right of student scores. Then manually reset the number of students and items. You may need additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.]

A password can be used to prevent unwanted changes to occur in the SEMEngine.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, April 17, 2013

Visual Education Statistics - Test Reliability Engine


                                                                      6

I used Table 8 (test reliability) as the foundation for the test reliability engine (Table 9).  The whole point of doing so was to provide a means of seeing the interactions when marks (Item scores of 1 and 0) are changed in a row or a column.

I removed the six most left columns from Table 8 as they are not needed after verifying the ANOVA table data in the previous post. The ANOVA Between Row and Count values (yellow) are converted from the normal Between Row and Count values.

The first thing I noticed was that rounding errors are no longer a problem with everything on one Excel worksheet. The results on Table 9 have been edited into prior posts.

Table 9 consists of the mark scores (1’s and 0’s) in a central cell field (22 students by 21 items). With the exception of the conversion from normal values to ANOVA values based on the Grand Mean (0.799), all other values are the same as on Table 8.

Test reliability is calculated with the KR20 and Cronbach’s alpha (0.29) as shown on Table 6. Table 9 contains an explained ANOVA table for between rows (student scores).

The second thing I learned was that sorting 1’s and 0’s in item columns so that all 1’s were at the top of the column and all 0’s were at the bottom produced a marked change in test reliability. This did not change item difficulty.

Any item with all 1’s in one group and all 0’s in another is set for maximum discrimination. Increasing discrimination increases test reliability because increasing discrimination increases the variation within student scores.

This makes sense. A test that accurately groups those who know and those who do not know is more reliable than one in which the marks scored 1 and 0 are mixed in a Guttman table.

Download TREngine for MAC and PC: TREngine.xls or TREngine.xlsm and save, or run in your browser. (When it does not work, some helpful information is frequently offered by the operating system.)

Deleting an item and replacing it to find which items contribute the most, or the least, to test reliability has been automated. Select the item number (ITEM #) in the bottom row of Table 9. Then click the Toggle button for your results. Click the Toggle button again to restore the item before selecting another item.

A scatter chart from all 21 single item deletions indicates that difficulty is not the primary factor in test reliability. Deleting the two most negative discriminating items increased test reliability the most. Deleting the most discriminating item decreased test reliability the most. The Spearman-Brown prediction formula estimated that a test reliability of 0.28 would be expected, after decreasing the number of items from 21 to 20, when doing the deletions.  The test reliability for all 21 items was 0.29.

The third thing I learned was that a 22 by 21 matrix is very unstable. I could only detect this with all four of the discussed statistics on one active Excel sheet. Changing a single mark from right to wrong or wrong to right in over 25 cells resulted in a range of change from 0.29 to a low of 0.21 to a high of 0.36 in test reliability. Cells around the edge of the cell field seemed to be the most sensitive. This range in sensitivity, suggests there is more information in this matrix than just harvesting variation with the Mean SS or Variance. Winsteps harvests unexpectedness from the matrix.

Table 9 combines four education statistics (count, average, standard deviation, and test reliability). It clearly shows that the more items on the test (the more Variance summed) and the more discriminating the items, the higher the test reliability. Table 9 also provides an easy way to explore ALL of the effects of changing an item or even a single mark. I could not have finished the last post without using it. Understanding is having relationships in mind. Table 9 dynamically relates facts, which in the traditional case, are usually presented in isolation.

[To use the Test Reliability Engine for other combinations than a 22 by 21 table requires adjusting the central cell field and the values of N for student and item. Then drag active cells over any new similar cells when you enlarge the cell field. You may need to do additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.

To reduce the cell field, use “Clear Contents” on the excess columns and rows on the right and lower sides of the cell field. Include the six cells that calculate SS that are below items and to the right of students scores. Then manually reset the number of students and items. You may need additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.]

A password is used to prevent unwanted changes to occur. The password is “PUP522”.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out from traditional multiple choice (TMC) to Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, April 10, 2013

Visual Education Statistics - Test Reliability


                                                               5

Statistic Four: An estimate of test reliability or reproducibility helps tune a test for a desired standard by replacing, adding, or removing items. This is helpful in the classroom. It is critical in the marketing of standardized tests. No one wants to buy an unreliable test. Time and money require the shortest test possible to meet desired standards.

There is no true standard available on which to base test reliability. The best that we can do is to use the current test score and it’s standard deviation (SD). The test score is as real as any athletic event score, or weather or stock market report. “Past performance is no guarantee of future performance.” The SD captures the score distribution in the form of the normal curve of error, as described in previous posts.

A Guttman table (Table 6) shows two ways to calculate the Mean Sum of Squares (MS) or Variance within item columns (2.96). The first uses the Mean SS as discussed in prior posts. [Mean Sum of Squares = Mean SS = Mean Square = MS = Variance] The second uses probabilities based on the difficulty of each item. The results are identical, 2.96 for large data sets (N) and 3.10 for classroom sized data sets (N – 1).

The KR20 and Cronbach’s alpha are then calculated using the ratio of the within item columns MS (2.96) to the student score row MS (4.08). [(21/20)*(1-(2.96/4.08) = 0.29] A test reliability of only 0.29 is very low.

The mean square within item columns MS (MSwic) must be relatively low to the student score row MS (MSrow) to obtain a high test reliability estimate.

But the more difficult an item is, the larger the contribution to the MSwic. The easiest item at 95% yields a Variance of 0.05. The most difficult item at 45% yields 0.25. To increase test reliability, the MSrow must increase, and the MSwic must decrease, in relation to one another.

The Unfinished and Discriminating items (Table 7) have similar difficulties: 73% and 71%. The test reliability increased from 0.29 to 0.47 when I deleted the eight (yellow) Unfinished items: 3, 7, 8, 9, 13, 17, 19, and 20 on Table 6. The MSwic fell 50% but MSrow fell only 36% to produce the increase in test reliability from 0.29 to 0.47. Getting rid of non-discriminating items helped.

A number of factors affect test reliability. Easy items (10, 12 and 21 in Table 6) yielded little to the Variance. We need easy items in the classroom to survey what students have mastered. Easy items are a waste of time and money on standardized tests designed only to rank students. Easy items do not spread out student scores. Easy items do little to support the student score MSrows.

This test only has 21 questions (Table 7). If the test had been 50 items long the estimated reliability would be 0.49, and with 100 items it would be 0.66. The test was too short using the current items. Doubling the length of this test (21 items to 42 items) by including a duplicate set of mark data increased the estimated test reliability from 0.29 to 0.65. MSwic doubled (twice as many items) but MSrow increased four times (the doubling of the score deviation was squared).

[There seems to be a discrepancy between the Spearman-Brown prediction formula in PUP 5.22 and the actual doubling of the length of this test with identical mark data on an Excel spreadsheet (22 to 50 students yields 0.29 to 0.49 compared to 22 to 44 students yields 0.29 to 0.65) That is, a lesser increase in students (27 and 22) produced a larger change in results (0.49 and 0.65).]

This test had five discriminating items (Table 7) yielding an estimated test reliability of 0.50, almost twice that for the entire test of 21 items. If a test of 50 such items were used, the estimated test reliability would be expected to be 0.91. This qualifies for a standardized test! (A dash is shown where calculations yield meaningless results in Table 7.)

Test reliability then increases with test length and with difficult items that are also discriminating. Marking a difficult item correctly has the same weight as marking an easy item correctly in determining test reliability (same MSrow, 4.08). An item has the same difficulty wither marked right by an able student or by a less than able student (same MScolumn, 9.58).

The forerunner of Power Up Plus (PUP) was originally compared to other test scoring software to verify that it was producing correct results. PUP also produces the same test reliability estimate as Winsteps: 0.29.
- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):
- - - - - - - - - - - - - - - - - - - - -

I have included the following discussion of the analysis of variance (ANOVA) while I have test reliability in mind again. You can skip to the next post unless you have an interest in the details of test reliability that show some basic relationships between sums of squares (SS). Or put in other words, if I can solve the same problem in more than one way, I just might be right in interpreting the paper by Li and Wainer, 1998, Toward a Coherent View of Reliability in Test Theory.

The ANOVA (Hoyt, 1941) and Cronbach’s alpha (1951) produce identical test reliability results. The ANOVA however makes clear that an assumption must be made for this to happen (Li and Wainer, 1998). This assumption provides a view into the depths of psychometrics that I have little intention to explore. It seems that the KR20 (Kuder & Richardson, 1937) and alpha test reliability is not a point but a region. They underestimate test reliability. Their estimates fall at the lower boundary of the region. The MSwic of 2.96 may be an over estimate of error, resulting in a lower test reliability estimate (0.29).

How much difference this really makes will have to wait until I get further into this study or until a more informed person can help out. If the difference is similar to that produced by the correction for small samples in the MSwic, (2.96 to 3.10, 1/22, or about 5%) on Table 6, then it may have a practical effect and should not be ignored. This may become very important when we get to the next statistic, statistic five: Standard Error of Measurement. The SSwic is also labeled interaction, error, unexplained, rows within columns, scores by difficulties, and scores within difficulties.  

The MSwic (Interactions) is assumed to be the error term in the ANOVA. This is using a customary means for solving difficult statistical, engineering, and political problems; simplifying the problem by ignoring a variable that may have little effect. The ANOVA tables in Table 8 reflect my understanding from Li and Wainer, 1998. Some help would be appreciated here too.

I used the “ANOVA Calculation Using a Correction Factor” on the right side of Table 8 to verify the total SS, score SS, and error SS (74.28 = 4.28 + 70.00). The required SS error term for the KR20 (SSwic of 65.14) is then found at the bottom of Table 4 and at the bottom of Table 8 (Scores by Difficulties: 74.28 – 9.14 = 65.14).  The item column SScolumns is 9.14. The value 65.14 is then the common factor in the two methods that results in the same test reliability estimate.

The SSs and MSs in yellow are based on a scale of 0 to 1 with a mean of the Grand Mean: 0.799. The SSs and MSs in white are based on a normal item count scale. The note indicates how to convert from one scale to the other. This makes a handy check on the correctness of setting up the Excel spread sheet if you resize the central data field from 22 students by 21 items (also see the next post, Test Reliability Engine).

The F test is improved from 1.28 in the “Unexplained Student Score ANOVA Table” to 1.31 in the “Explained Student Score ANOVA Table.” Neither exceeds the critical value of 1.62. These answer mark data may result from luck on test day from many sources (student preparedness, selection of test items, testing environment, attitude, error in marking, chance, and etc.). The ANOVA table confirms a test reliability of 0.29 is low. The descriptive statistics are valid for this test, but no predictions can be made.

The SSwic Interactions (65.14) sums the variation in marks within each column [(=VAR.P(B5:B26) from B5 to V5) x 22 students]. The SSwir Interactions (70.00) sums the variation in marks within each row [(=VAR.P(B5:V5) from B5 to B26) x 21 items]. The cell Interactions, the total SS, (74.12) sum the variation in the item scores (0 and 1) within the full Guttman table [=VAR.P(B5:V26) x 462 marks].

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):