Wednesday, July 9, 2014

Small Sample Math Model - SEMs

                                                                      8
The test standard error of measurement (SEM) can be calculated in two ways: The traditional way is by relating the variance between student scores and within item difficulties; between an external column and the internal cell columns.

The second way harvests the variance conditioned on each student score and then sums the CSEM (SQRT(conditional student score error variance)) for the test. The first method links two properties: student ability and item difficulty. The second only uses one property: student ability.

I set up a model with 12 students and 11 items (see previous post and Table26.xlsm below). Extreme values of zero and 100% were excluded. Four samples with average test scores of 5, 6, 7 (Table 29), and 8 were created with the standard deviation (1.83) and the variance within item difficulties (1.83) held constant. This allowed the SEM to vary between methods.

The calculation of the test SEM (1.36) by way of reliability (KR20) is reviewed on the top level of Chart 73. The test SEM remained the same for all four tests.

My first calculation of the test SEM by way of conditional standard error of measurement (CSEM) began with the deviation of each mark from the student score (Table 29 center). I squared the deviations and summed to get the conditional variance for each score. The individual student CSEM is given as the square root of the conditional variance (the SD of the conditional variance). The test SEM (1.48) is then the sum of the student CSEM values.

[My second calculation was based on the binomial standard error of measurement given in Crocker, Linda, and James Algina, 1986, Introduction to Classical & Modern Test Theory, Wadsworth Group, pages 124-127.

By including the “correction for obtaining unbiased estimates of population variance”, (n/(n – 1), the SEM value increased from 1.48 to 1.55 (Table 29). This is a perfect match to the binomial SEM.]

The two SEMs are then based on different sample sizes and different assumptions. The traditional SEM (1.36) is based on the raggedly distributed small sample size in hand. The binomial SEM (1.55) assumes a perfectly normally distributed large theoretical population.

[Variance calculations (variance is additive):

  • Test variance: Score deviations from the test mean (as counts), squared, and summed = a sum of squares (SS). SS/N = MSS or variance: 3.33. {Test SD = SQRT(Var) = 1.83. Test SEM = 1.36.}

  • Conditional error variance: Deviations from the student score (as a percent), squared, and summed = the conditional error variance (CVar) for that student score. {Test SEM = Average SQRT(CVar) = 1.48 (n) and 1.55 (n-1)}

  • Conditional error variance: Variance Within the Score row (Excel, VAR) x (n or n - 1) = the CVar for that student score. {Test SEM VAR.P = 1.48 and VAR.S = 1.55.] 
Squaring values produces curved distributions (Chart 73). The curves represent the possible values. They do not represent the number of items or student scores having those values.

The True MSS = Total MSS – Error MSS = 3.33 -1.83 = 1.50, involves subtracting a convex distribution centered on the average test score from a concave distribution centered on the maximum value of 0.25 (not on the average item difficulty).

The student score MSS is at a maximum when the item error SS is at a minimum. The error MSS is at a maximum (0.25) when the student score MSS is at a minimum (0.00). This makes sense. This item is perfectly aligned with the student score distribution at a point where there is not differing from the average test score.

The KR20 is then a ratio of the True MSS/Total MSS, 1.50/3.33 = 0.50. [KR20 ranges from 0 to 1, not reproducible to fully reproducible]. The test SEM is then a portion, SQRT(1 – KR20) of the SD [also 1.83 in this example, SQRT(3.33)] = SQRT(1 – 0.50) * 1.83 = 1.36.

I was able to set the test SEM estimates using KR20 all to 1.36 for all four tests by setting the SD of student scores and the item error MSS to constant values by switching a 0 and 1 pair in student mark patterns. [The SD and the item error MSS do not have to be the same values.]

All possible individual student score binomial CSEM values for a test with 11 items are listed in Table 30. The CSEM is given as the SQRT(conditional variance). The conditional variance is: (X * (n – X))/(n – 1) or n*(pg) * (n/(n - 1)). There is then no need to administer a test to calculate a student score binomial CSEM value. There is a need to administer a test to find the test SEM. The test SEM (Table 29) is the sum of these values, 1.55.

The student CSEM and thus the test SEM values are derived only from student mark patterns. They differ from the test SEM values derived from the KR20 (Table 31). With KR20 derived values held constant, the binomial CSEM derived values for SEM decreased with higher test scores. This makes sense. There is less room for chance events. Precision increases with higher test scores.

Given a choice, a testing company would select the KR20 method using CTT analysis to report test SEM results.

[The same SEM values for tests with 5 right and 6 right resulted from the fact that the median score was 5.5. The values for 5 right and 6 right fall an equal distance from the mean on either side. Therefore 5 and 6 or 6 and 5 both add up to 11.]

I positioned the green curve on Chart 73 using the above information.

A CSEM value is independent from the average test score and item difficulties. (Swapping paired 0s and 1s in student mark patterns to adjust the item error variance made no difference in the CSEM value.) The average of the CSEM values, the test SEM, is dependent on the number of items on the test with each value. If all scores are the same, the CSEMs and the SEM will be the same (Tables 30 and 31).

I hope at this stage to have a visual mathematical model that is robust enough to make meaningful comparisons with the Rasch IRT model. I would like to return to this model and do two things (or have someone volunteer do it):

  1. Combine all the features that have been teased out, in Chart 72 and Chart 73, into one model.
  2. Animate the model in a meaningful way with change gages and history graphs.
Now to return to the Nursing data that represent the real classroom, filled with successful instruction, learning, and assessment.

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request. (Files hosted at nine-patch.com are also being relocated now that Nine-Patch Multiple-Choice, Inc has been dissolved.)

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, June 18, 2014

Small Sample Math Model - Item Discrimination

                                                                   #7 
The ability of an item to place students into two distinct groups is not a part of the mathematical model developed in the past few posts. Discrimination ability, however, provides insight into how the model works. A practical standardized test must have student scores spread out enough to assign desired rankings. Discriminating items produce this spread of student scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does not tell us what a student actually knows that is useful and meaningful to the student as the basis for further learning and effective instruction. This can be done with Knowledge and Judgment Scoring and the partial credit Rasch IRT model using the very same tests. This post is using traditional scoring as it simplifies the analysis (and the model) to just right and wrong, no judgment or higher levels of thinking are required of students.

I created a simple data set of 12 students and 11 items (Table 26) with an average score of 5. I then modified this set to produce average scores of 6, 7, and 8 (Table 27). [This can also be considered as the same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a test with an average score of 8 was 1.83. I then adjusted the MSS for the other three grades to match this value. A right and a wrong mark were exchanged in a student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85, 1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495 and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected, with the average test score. The average item discrimination (Pearson r and PBR) (Table 28) was stable. In general, with a few outliers in this small data set, the most discriminating items had the same difficulty as the average test score. [This behavior for the item discrimination to be maximized at the average test score is a basic component of the Rasch IRT model, which by design limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that items tend to be most discriminating when they have a difficulty near the average test score (not just near 50%).

The question is often asked, “Do tests have to be designed for an average score of 50%?”  If the SD remains the same, I found no difference in the KR20 or SEM. [The observed SD is ignored by the Rasch IRT model used by many states for test analysis.]

The maximum item discrimination value of 0.64 was always associated with an item mark pattern in which all right marks and all wrong marks were in two groups with no mixing of right and wrong marks. I loaded a perfect Guttman mark pattern and found that 0.64 was the maximum corrected value for this size of data set. (The corrected values are better estimates than the uncorrected values in a small data set.)

Items of equal difficulty can have very different discrimination values. In Table 26, three items have a difficulty of 7 right marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in estimating test reliability by deleting an item and recalculating the test reliability to find the effect of any item in a test. The VESEngine (free download below) includes this feature: Test Reliability (TR) toggle button. Test reliability (KR20) and item discrimination (PBR) are interdependent on student and item performance. A change in one usually results in a change in one or more of the other factors. [Student ability and item difficulty are considered independent using the Rasch model IRT analysis.] {I have yet to determine if comparing CTT to IRT is a case of comparing apples to apples, apples to oranges or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of the error MSS (black curve) and the portion of right and wrong marks (red curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at each end. Both are insensitive to the position of right marks in an item mark pattern. The average score for right and for wrong marks is sensitive to the mark pattern as the difference between these two values determines part of the item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can use a range of average test scores. In this example there was no difference in the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis transforms normal counts into logits that have only one reference point of 50% (zero logit) when student ability and item difficulty are positioned on one common scale. This point is then extended in either direction by values that represent equal student ability and item discrimination (50% right) from zero to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, May 7, 2014

Test Scoring Math Model - Precision

                                                               #6
The precision of the average test score can be obtained from the math model in two ways: directly from the mean sum of squares (MSS) or variance, and traditionally, by way of the test reliability (KR20).

I obtained the precision of each individual student test score from the math model by taking the square root of the sum of squared deviations (SS) within each score mark pattern (green, Table 25). The value is called the conditional standard error of measurement (CSEM) as it sums deviations for one student score (one condition), not for the total test.

I multiplied the mean sum of squares (MSS) by the number of items averaged (21) to yield the SS (0.15 x 21 = 3.15 for a 17 right mark score) (or I could have just added up the squared deviations). The SQRT(3.15) = 1.80 right marks for the CSEM. Some 2/3 of the time a re-tested score of 17 right marks can be expected to fall between 15.20 and 18.80 (15 and 19) right marks (Chart 70).

The test Standard Error of Measurement (SEM) is then the average of the 22 individual CSEM values (1.75 right marks or 8.31%).

The traditional derivation of the test SEM (the error in the average test score) combines the test reliability (KR20) and the SD (spread) of the average test score.

The SD (2.07) is from the SQRT(MSS, 4.08) between student scores. The test reliability (0.29) is the ratio of the true variance (MSS, 1.12) to the total variance (MSS, 4,08) between student scores (see previous post).

The expectation is that the greater the reliability of a test, the smaller the error in estimating the average test score. An equation is now needed to transform variance values on the top level of the math model to apply to the lower linear level.

SEM = SQRT(1 – KR20) * SD = SQRT(1 – 0.29) * 2.07 = SQRT(0.71) * 2.07 = 0.84 * 2.07 = 1.75 right marks.

The operation of “1 – KR20” aligns the value of 0.71 to extract the portion of the SD that represents the SEM. If the test reliability goes up, the error in estimating the average test score (SEM) goes down.

Chart 70 shows the variance (MSS), the SS, and the CSEM based on 21 items, for each student score. It also shows the distribution of the CSEM values that I averaged for the test SEM.

The individual CSEM is highest (largest error, poorer precision) when the student score is 50% (Charts 65 and 70). Higher student scores yield lower CSEM values (better precision). This makes sense.

The test SEM (the average of the CSEM values) is related to the distribution of student test scores (purple dash, Chart 70). Adding easy items (easy in the sense that the students were well prepared) decreases error, improves precision, reduces the SEM.

- - - - - - - - - - - - - - - - - - - - - 


The Best of the Blog - FREE
  • The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
  • This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.
  • Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xlsQuick Start

Wednesday, April 23, 2014

Test Scoring Math Model - Reliability

                                                                5
An estimate of the reliability or reproducibility of a test can be extracted from the variation within the tabled right marks (Table 25). The variance from within the item columns is related to the variance from within the student score column.

The error within items variance (2.96) and total variance (MSS) between student scores (4.08) are both obtained from columns in Table 25b (blue, Chart 68). The true variance is then 4.08 – 2.96 = 1.12.

The ratio of true variance to the total variance between scores (1.12/4.08) becomes an indicator of test reliability (0.28). This makes sense.

A test with perfect reliability (4.08/4.08 = 1.0) would have no variation, error variance = 0, within the item columns in Table 25. A test with no reliability (0.0/4.08) would show equal values (4.08) for within item columns, and between test scores.

The KR20 formula then adjusts the above value (0.28 x 21/20) to 0.29 [from a large population (n) to a small sample value (n-1)]. The KR20 ratio has no unit labels (“var/var” = “”). All of the above takes place on the upper (variance) level of the math model.

Doubling the number of students taking the test (Chart 69) has no effect on reliability. Doubling the number of items doubles the error variance but increases the total variance by the square. The test reliability increases from 0.29 to 0.64.

The square root of the total variance between scores (4.08) yields the standard deviation (SD) for the score distribution [(2.02 for (n) and 2.07 for (n-1)] on the lower floor of the math model.

- - - - - - - - - - - - - - - - - - - - - 

The Best of the Blog - FREE
  • The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
  • This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.
  • Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xlsQuick Start

Wednesday, March 5, 2014

Test Scoring Math Model - Variance

                                                             #4
The first thing I noticed when inspecting the top of the test scoring math model (Table 25) was that the variation within the central cell field has a different reference point (external to the data) than the variation between scores in the marginal cell column (internal to the data). Also the variation within the central cell field (the variance) is harvested in two ways: within rows (scores) and within columns (items).

The mean sum of squared deviations (MSS) or variance within a column or a row has a fixed range (Chart 64 and Chart 65). The maximum occurs when the marks are 1/2 right and 1/2 wrong (1/2 x 1/2 = 1/4 or 25%). [Variance also equals p * q or (Right * Wrong)/(Right + Wrong)] The contribution each mark makes to the variance is distributed along this gentle curve. The variable data are fit to a rigid model.


I obtained the overall shape of these two variances by folding Chart 64 and Chart 65 into Photo 64-65.  The result is a dome or a depression above or below the upper floor of the model.

The peak of the dome (maximum variance) is reached when a student functioning at 50% marks an item with 50% difficulty. Standardized test makers try to maximize this feature of the model. The larger the mismatch between item difficulty and student ability, the lower down the position of the variance on the dome. CAT attempts to adjust item difficulty to match student preparedness.

Chart 66 is a direct overhead view of the dome. Elevation lines have been added at 5% intervals from zero to 25%. I then fitted the data from Nursing124 to the roof of the model. The data only spread over one quadrant of the model. The data could completely cover the dome in an ideal situation in which every combination of score and difficulty occurred.

The total test variance within items is then the sum of the variance within all items (0.04 to 0.25 = 2.96). The total test variance within scores is the sum of the variance of all scores (0.05 to 0.24 = 3.33). See Table 8.

The math model adjusts to fit the data in the marginal cell student score column (variance between scores). The reference point is not a static feature of the model but the average test score (16.77 or 80%). The plot of the variance between scores can be attached to the right side of the math model (Chart 67).

The variance within columns and rows spreads across the static frame of the model. The model then adjusts to fit the variance between scores (rows) to match the spread of the active within rows.

I can see another interpretation of the model variance if the dome is inverted as a depression. As a flight instrument on a blimp: pitch, roll, and yaw (within item, 2.96; within score, 3.31; and between scores, 4.10) the blimp would have the nose up, rolled to the side, and with the rudder hard over.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, February 19, 2014

Test Scoring Math Model - Input

The mathematical model (Table 25) in the previous post relates all the parts of a traditional item analysis including the observed score distribution, test reproducibility, and the precision of a score. Factors that influence test scores can be detected and measured by the variation between and within selected columns and rows.

The model is only aware of variation within and between mark patterns (deviations from the mean). The variance (the sum of squared deviations from the mean divided by the number summed or the mean sum of squares or MSS) is the property of the data that relates the mark patterns to the normal distribution. This permits generating useful descriptive and predictive insights.

The deviation of each mark from the mean is obtained by subtracting the mean from the value of the mark (Table 25a). The squared deviation value is then elevated to the upper floor of the model (Step 1, Table 25b). [Un-squared deviations from the mean would add up to zero.]

[IF YOU ARE ONLY USING MULTIPLE-CHOICE TO RANK STUDENTS, YOU MAY WANT TO SKIP THE FOLLOWING DISCUSSION ON THE MEANING OF TEST SCORES WHEN USED TO GUIDE INSTRUCTION AND STUDENT DEVELOPMENT.]

The model’s operation gains meaning by relating the score and item mark distributions to a normal distribution. It compares observed data to what is expected from chance alone or as I like to call it, the know-nothing mean.

The expected know-nothing mean based on 0-wrong and 1-right with 4-option items (popular on standardized tests) is centered on 25%, 6 right out of 24 questions (Chart 62). This is from luck on test day alone (students only need to mark each item; they do not need to read the test) on a traditional multiple-choice test (TMC). The mean moves to 50% if student ability and item difficulty have equal value. It moves to 80% if students are functioning near the mastery level as seen in the Nursing124 data. The math model will adjust to fit these data.

The know-nothing mean, with Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM), is at 50% for a high quality student or 25% for a low quality student (same as TMC). Scoring is 0-wrong, 1-have yet to learn, and 2-right.  A high quality student accurately, honestly, and fairly reports what is trusted to be useful in further instruction and learning. There are few, if any, wrong marks. A low quality student performs the same on both methods of scoring by marking an answer on all items. Students adjust the test to fit their preparation.

The know-nothing mean for Knowledge Factor (KF) is above 75% (near the mastery level in the Nursing124 data, violet). KF weights knowledge and judgment as 1:3, rather than 1:1 (KJS) or 1:0 (TMC). High-risk examinees do not guess. Test takers are given the same opportunity as teachers and test makers to produce accurate, honest, and fair test scores.

The distribution of scores about the know-nothing mean are the same for TMC (green, Chart 63) and KJS (red, Chart 63). An unprepared student can expect, on average, a score of 25% on a TMC test with 4-option items. Some 2/3 of the time the score will fall within +/- 1 standard deviation of 25%. As a rule of thumb, the standard deviation (SD) on a classroom test tends to be about 10%. The best an unprepared student can hope for is a score over 35% (25 + 10) about 1/6 of the time ((1 - 2/3)/2).

The know-nothing mean (50%) for KJS and the PCRM is very different from TMC (25%) for low quality students. The observed operational mean at the mastery level (above 80%, violet) is nearly the same for high quality students electing either method of scoring. High quality students have the option of selecting items they can trust they can answer correctly. There are few to no wrong marks. [Totally unprepared high quality students could elect to not mark any item for a score of 50%.]

The mark patterns on the lower floor of the mathematical model have different meanings based on the scoring method. TMC delivers a score that only ranks the student’s performance on the test. KJS and the PCR deliver an assessment of what a student knows or can do that can be trusted as the basis for further learning and instruction. Quantity (number right) and quality (portion marked that are right) are not linked. Any score below 50% indicates the student has not developed a sense of judgment needed to learn and report at higher levels of thinking.

The score and item mark patterns are fed into the upper floor of the mathematical model as the squared deviation from the mean (d^2). [A positive deviation of 3 and a negative deviation of 3 both yield a squared deviation of 9.] The next step is to make sense of (to visualize, to relate) the distributions of the variance (MSS) from columns and rows.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):




Wednesday, February 5, 2014

Test Scoring Mathematical Model

The seven statistics reviewed in previous posts need to be related to the underlying mathematics. Traditional multiple-choice (TMC) data analysis has been expressed entirely with charts and the Excel spreadsheet VESEngine. I will need a TMC math model to compare TMC with the Rasch model IRT that is the dominant method of data analysis for standardized tests.

A mathematical model contains the relationships and variables listed in the charts and tables. This post applies the advice in learning discussed in the previous post. It starts with the observed variables. The mathematical model then summarizes the relationships in the seven statistics.



The model contains two levels (Table 25). The first floor level contains the observed mark patterns. The second floor level contains the squared deviations from the score and item means; the variation in the mark patterns. The squared values are then averaged to produce the variance. [Variance = Mean sum of squares = MSS]

1. Count

The right marks are counted for each student and each item (question). TMC: 0-wrong, 1-right captures quantity only. Knowledge and Judgment Scoring (KJS) and the partial credit Rash model (PCRM) capture quantity and quality: 0-wrong, 1-have yet to learn this, 2-right.
Hall JR Count = SUM(right marks) = 20   
Item 12 Count = SUM(right marks) = 21  

2. Mean (Average)

The sum is divided by the number of counts. (N students, 22 and n items, 21)
The SUM of scores / N = 16.77; 16.77/n = 0.80 = 80%
The SUM of items / n = 17.57; 17.57/N = 0.80 = 80%

3. Variance

The variation within any column or row is harvested as the deviation between the marks in a student (row) or item (column) mark pattern, or between student scores, with respect to the mean value. The squared deviations are summed and averaged as the variance on the top level of the mathematical model (Table 25).
Variance = SUM(Deviations^2)/(N or n) = SUM of Squares/(N or n) = Mean SS = MSS

4. Standard Deviation

The variation within a score, item, or probability distribution expressed as a normal value that +/- the mean includes 2/3 of a normal, bell-shaped, distribution: 1 Standard Deviation = 1SD.

SD = Square Root of Variance or MSS = SQRT(MSS) = SQRT(4.08) = 2.02

For small classroom tests the (N-1) SD = SQRT(4.28) = 2.07 marks

The variation in student scores and the distribution of student scores are now expressed on the same normal scale.

5. Test Reliability

The ratio of the true variance to the score variance estimates the test reliability: the Kuder-Richardson 20 (KR20). The score (marginal column) variance – the error (summed from within Item columns) variance = the true variance.

KR 20 = ((score variance – error variance)/score variance) x n/1-n)
KR 20 = ((4.08 – 2.96)/4.08) x 21/20 = 0.29

This ratio is returned to the first floor of the model. An acceptable classroom test has a KR20 > 0.7. An acceptable standardized test has a KR20 >0.9.

6. Traditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall is the standard error of measurement (SEM). The traditional SEM is based on the average performance of your class: 16.77 +/- 1SD (+/- 2.07 marks).

SEM = SQRT(1-KR20) * SD = SQRT(1- 0.29) * 2.07 = +/-1.75 marks

On a test that is totally reliable (KR20 = 1), the SEM is zero. You can expect to get the same score on a retest.

7. Conditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall based on the rank of your test score alone (conditional on one score rank) is the conditional standard error of measurement (CSEM). The estimate is based (conditional) on your test score rather than on the average class test score.

CSEM = SQRT((Variance within your Score) * n number of questions) = SQRT(MSS * n) = SQRT(SS)
CSEM = SQRT(0.15 * 21) = SQRT(3.15) = 1.80 marks

The average CSEM values (1.75) for all of your class (light green) also yields the test SEM. This confirms the above calculation for 6. Traditional Standard Error of Measurement for the test.

This mathematical model (Table 25) separates the flat display in the VESEngine into two distinct levels. The lower floor is on a normal scale. The upper floor isolates the variation within the marking patterns on the lower floor. The resulting variance provides insight into the extent that the marking patterns could have occurred by luck on test day and into the performance of teachers, students, questions, and the test makers. Limited predictions can also be made.

Predictions are limited using traditional multiple-choice (TMC) as students have only two options: 0-wrong and 1-right. Quantity and quality are linked into a single ranking. Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM) separate quantity and quality: 0-wrong, 1-have yet to learn, and 2-right. Students are free to report what they know and can do accurately, honestly, and fairly.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):