5

**Statistic Four:**An estimate of

**test reliability**or reproducibility helps tune a test for a desired standard by replacing, adding, or removing items. This is helpful in the classroom. It is critical in the marketing of standardized tests. No one wants to buy an unreliable test. Time and money require the shortest test possible to meet desired standards.

There is no true standard available on which to base test
reliability. The best that we can do is to use the current test score and it’s standard
deviation (SD). The test score is as real as any athletic event score, or
weather or stock market report. “Past performance is no guarantee of future
performance.” The SD captures the score distribution in the form of the normal
curve of error, as described in previous posts.

A Guttman table (Table 6) shows two ways to calculate the
Mean Sum of Squares (MS) or Variance

**within**item columns (2.96). The first uses the Mean SS as discussed in prior posts. [Mean Sum of Squares = Mean SS = Mean Square = MS = Variance] The second uses probabilities based on the difficulty of each item. The results are identical, 2.96 for large data sets (N) and 3.10 for classroom sized data sets (N – 1).
The KR20 and Cronbach’s alpha are then calculated using the
ratio of the within item columns MS (2.96) to the student score row MS (4.08).
[(21/20)*(1-(2.96/4.08) = 0.29] A test reliability of only 0.29 is very low.

The mean square within item columns MS (MSwic) must be
relatively low to the student score row MS (MSrow) to obtain a high test
reliability estimate.

But the more difficult an item is, the larger the
contribution to the MSwic. The easiest item at 95% yields a Variance of 0.05.
The most difficult item at 45% yields 0.25. To increase test reliability, the
MSrow must increase, and the MSwic must decrease, in relation to one another.

The Unfinished and Discriminating items (Table 7) have
similar difficulties: 73% and 71%. The test reliability increased from 0.29 to
0.47 when I deleted the eight (yellow) Unfinished items: 3, 7, 8, 9, 13, 17,
19, and 20 on Table 6. The MSwic fell 50% but MSrow fell only 36% to produce
the increase in test reliability from 0.29 to 0.47. Getting rid of
non-discriminating items helped.

A number of factors affect test reliability. Easy items (10,
12 and 21 in Table 6) yielded little to the Variance. We need easy items in the
classroom to survey what students have mastered. Easy items are a waste of time
and money on standardized tests designed only to rank students. Easy items do
not spread out student scores. Easy items do little to support the student
score MSrows.

This test only has 21 questions (Table 7). If the test had
been 50 items long the estimated
reliability would be 0.49, and with 100 items it would be 0.66. The test
was too short using the current items. Doubling the length of this test (21
items to 42 items) by including a duplicate set of mark data increased the
estimated test reliability from 0.29 to 0.65. MSwic doubled (twice as many
items) but MSrow increased four times (the doubling of the score deviation was
squared).

[There seems to be a discrepancy between the Spearman-Brown
prediction formula in PUP 5.22 and the
actual doubling of the length of this test with identical mark data on an Excel
spreadsheet (22 to 50 students yields 0.29 to 0.49 compared to 22 to 44
students yields 0.29 to 0.65) That is, a lesser increase in students (27 and
22) produced a larger change in results (0.49 and 0.65).]

This test had five discriminating items (Table 7) yielding
an estimated test reliability of 0.50, almost twice that for the entire test of
21 items. If a test of 50 such items were used, the estimated test reliability
would be expected to be 0.91. This qualifies for a standardized test! (A dash
is shown where calculations yield meaningless results in Table 7.)

Test reliability then increases with test length and with
difficult items that are also discriminating. Marking a

**difficult**item correctly has the same weight as marking an**easy**item correctly in determining test reliability (same MSrow, 4.08). An item has the same difficulty wither marked right by an able student or by a less than able student (same MScolumn, 9.58).
The forerunner of Power Up Plus (PUP) was originally compared to other test
scoring software to verify that it was producing correct results. PUP also
produces the same test reliability estimate as Winsteps: 0.29.

- - - - -
- - - - - - - - - - - - - - - -

Free software to
help you and your students experience and understand the change from TMC to KJS
(tricycle to bicycle):

- - - - - - - -
- - - - - - - - - - - - -

I have included the following discussion of the analysis of
variance (ANOVA) while I have test reliability in mind again. You can skip to
the next post unless you have an interest in the details of test reliability
that show some basic relationships between sums of squares (SS). Or put in
other words, if I can solve the same problem in more than one way, I just might
be right in interpreting the paper by Li and Wainer, 1998, Toward a Coherent
View of Reliability in Test Theory.

The ANOVA (Hoyt, 1941) and Cronbach’s alpha (1951) produce
identical test reliability results. The ANOVA however makes clear that an
assumption must be made for this to happen (Li and Wainer, 1998). This
assumption provides a view into the depths of psychometrics that I have little
intention to explore. It seems that the KR20 (Kuder & Richardson, 1937) and
alpha test reliability is not a point but a region. They underestimate test
reliability. Their estimates fall at the lower boundary of the region. The
MSwic of 2.96 may be an over estimate of error, resulting in a lower test
reliability estimate (0.29).

How much difference this really makes will have to wait
until I get further into this study or until a more informed person can help
out. If the difference is similar to that produced by the correction for small
samples in the MSwic, (2.96 to 3.10, 1/22, or about 5%) on Table 6, then it may
have a practical effect and should not be ignored. This may become very important
when we get to the next statistic, statistic five: Standard Error of Measurement.
The SSwic is also labeled interaction, error, unexplained, rows within columns,
scores by difficulties, and scores within difficulties.

The MSwic (Interactions) is assumed to be the error term in
the ANOVA. This is using a customary means for solving difficult statistical,
engineering, and political problems; simplifying the problem by ignoring a
variable that may have little effect. The ANOVA tables in Table 8 reflect my
understanding from Li and Wainer, 1998. Some help would be appreciated here
too.

I used the “ANOVA Calculation Using a Correction Factor” on
the right side of Table 8 to verify the total SS, score SS, and error SS (74.28
= 4.28 + 70.00). The required SS error term for the KR20 (SSwic of 65.14) is
then found at the bottom of Table 4 and at the bottom of Table 8 (Scores by
Difficulties: 74.28 – 9.14 = 65.14).
The item column SScolumns is 9.14. The value 65.14 is then the common
factor in the two methods that results in the same test reliability estimate.

The SSs and MSs in yellow are based on a scale of 0 to 1
with a mean of the Grand Mean: 0.799. The SSs and MSs in white are based on a
normal item count scale. The note indicates how to convert from one scale to
the other. This makes a handy check on the correctness of setting up the Excel
spread sheet if you resize the central data field from 22 students by 21 items
(also see the next post, Test Reliability Engine).

The F test is improved from 1.28 in the “Unexplained Student
Score ANOVA Table” to 1.31 in the “Explained Student Score ANOVA Table.”
Neither exceeds the critical value of 1.62. These answer mark data may result
from luck on test day from many sources (student preparedness, selection of
test items, testing environment, attitude, error in marking, chance, and etc.).
The ANOVA table confirms a test reliability of 0.29 is low. The descriptive statistics
are valid for this test, but no predictions can be made.

The SSwic Interactions (65.14) sums the variation in
marks within each column [(=VAR.P(B5:B26) from B5 to V5) x 22 students]. The
SSwir Interactions (70.00) sums the variation in marks within each row [(=VAR.P(B5:V5)
from B5 to B26) x 21 items]. The cell Interactions, the total SS, (74.12) sum
the variation in the item scores (0 and 1) within the full Guttman table
[=VAR.P(B5:V26) x 462 marks].

- - - - -
- - - - - - - - - - - - - - - -

Free software to
help you and your students experience and understand the change from TMC to KJS
(tricycle to bicycle):

## No comments:

## Post a Comment