Wednesday, October 8, 2014

Customizing Test Precision - Information Functions

                                                               11

(Continued from the prior two posts.)

The past two posts have established that there is little difference between classical test theory (CTT) and item response theory (IRT) in respect to test reliability and conditional error of measurement (CSEM) estimates (other than the change in scales). IRT now is the analysis of choice for standardized tests. The Rasch model IRT is the easiest to use and also works well with small data sets including classroom tests. How two normal scales for student scores and item difficulties are combined onto one IRT logit scale is no longer a concern to me, other than the same method must be used throughout the duration of an assessment program.

Table 33
What is new and different from CTT is an additional insight from the IRT data in Table 32c (information p*q values). I copied Table 32 into Table 33 with some editing. I colored the cells holding the maximum amount of information (0.25) yellow in Table 33c. This color was then carried back to Table 33a, Right and Wrong Marks. [Item Information is related to the marginal cells in Table 33a (as probabilities), and not to the central cell field (as mark counts).] The eleven item information functions (in columns) were re-tabled into Table 34 and graphed in Chart 75. [Adding the information in rows yields the student score CSEM in Table 33c.]

Table 34
Chart 75
The Nurse124 data yielded an average test score of 16.8 marks or 80%. This skewed the item information functions away from the 50% or zero logit difficulty point (Chart 75). The more difficult the item, the more information developed, from 0.49 to 1.87 for 95% right count to a maximum at 54% and 45% right count. [No item on the test had a difficulty of 50%.]

Table 35
Chart 76
The sum of information (59.96) by item difficulty level and student score level is tabled in Table 35 and plotted as the test information function in Chart 76. This test does not do a precise job of assessing student ability. The test was most precise (19.32) at the 16 right count/76% right location. [Location can be designated by measure (logit), input raw score (red) or output expected score (Table 33b).]

The item with an 18 right count/92% right difficulty (Table 35) did not contribute the most information individually but did as a group of three items (9.17).  The three highest scoring, easiest, items (counts of 19, 20, and 21) are just too easy for a standardized test but may be important survey items needed to verify knowledge and skills for this class of high performing students. None of these three items reached an information level maximum of 1/4. [It now becomes apparent how items can be selected to produce a desired test information function.]

The more information available is interpreted as greater precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q)) on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT + WG)^2 or (3 x 3)/36 = 1/4].

Each item information function spans the range of student scores on the test (Chart 76). Each item information function measures student ability most precisely near the point that item difficulty and student ability match (50% right) along the IRT S-curve. [The more difficult an item, the more ability students must have to mark correctly 50% of the time. Student ability is the number correct on the S-curve. Item difficulty is the number wrong on the S-curve (see more at Rasch Model Audit).]   

Extracting item information functions from a data table provides a powerful tool (a test information function) for psychometricians to customize a test (page 127, Maryland 2010). A test can be adjusted for maximum precision (minimum CSEM) at a desired cut point.

The bright side of this is that the concept of “information” (not applicable to CTT), and the ability to put student ability and item difficulty on one scale, gives psychometricians powerful tools. The dark side is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the NCLB era, as psychometrics has made marked improvements, the student mark data it is being supplied has remained in the casino arena: Mark an answer to each question (even if you cannot read or understand the question), do not guess, and hope for good luck on test day.

The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak psychometric performance. Standardized testing has migrated away from letting students report what they know and can do to a lottery that compares their performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day. 

The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.] 

It does not have to be that way. Knowledge and Judgment Scoring (classroom friendly) and the partial credit Rasch model (that is included in the software states use) can both update traditional multiple-choice to the levels of thinking required by the common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT. 


- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, September 10, 2014

Conditional Standard Error of Measurement - Precision

                                                              10    
(Continued from prior post.)

Table 32a contains two estimates (red) of the test standard error of measurement (SEM) that are in full agreement.  One estimate, 1.75, is from the average of the conditional standard error of measurements (CSEM, green) for each student raw score. The traditional estimate, 1.74, uses the traditional test reliability, KR20. No problem here.

The third estimate of the test SEM in Table 32c is different. It is based on CSEM values expressed in logits (the natural log, 2.718) rather than on the normal scale. The values are also inverted in relation to the traditional values in Table 32 (Chart 74). There is a small but important difference. The IRT CSEM values are much more linear that the CTT CSEM values. Also the center of this plot is the mean of the number of items (Chart 30, prior post), not the mean of the item difficulties or student scores. [Also most of this chart was calculated as most of these relationships do not require actual data to be charted. Only nine score levels came from the Nurse124 data.]

Chart 74 shows the binomial CSEM values for CTT (normal) and IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in logits) = 1/(SEM(Raw Score)”, 2007. I then adjusted each of these so the corresponding curves, on the same scale, crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The extreme values for no right and all right were not included. CSEM values for extreme values go to zero or to infinity with the following result:

“An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm

Precision is then not a constant across the range of student scores for both methods of analysis. The test SEM of 0.64 logits is comparable to 1.74 counts on the normal scale.

The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values estimates the test SEM as 1.75 counts out of a range of 21 items. This is less than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and useful.

For IRT analysis, the CSEM indicates the degree that the data fit the perfect Rasch model (3). A better fit also results in more believable and useful results.

“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation of an imagined error distribution representing the possible distribution of observed values around their “true” theoretical value. This precision is based on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the measures or estimates correspond to a reference standard outside the data, in this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm

Precision also has some very practical limitations when delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very neat tables showing the number of items that must be on a test to obtain a desired degree of precision expressed in logits and in confidence limits. The closer the test “targets” an average score of 50%, the fewer items needed for a desired precision.

The two top students, with the same score of 20, missed items with different difficulties. They both yield the same CSEM. The CSEM ignores the pattern of marks and the difficulty of items. A CSEM value obtained in this manner is related only to the raw score. Absolute values for the CSEM are sensitive to item difficulty (Table 23a and 23b).

The precision of a cut score has received increasing attention during the NCLB era. In part, court actions have made the work of psychometricians more transparent. The technical report for a standardized test can now exceed 100 pages. There has been a shift of emphasis from test SEM, to individual score CSEM, to IRT information as an explanation of test precision.

 “(Note that the test information function and the raw score error variance at a given level of proficiency [student  score], are analogous for the Rasch model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The conditional standard error of measurement is the inverse of the information function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.

I cannot end this without repeating that this discussion of precision is based on traditional multiple-choice (TMC) that only ranks students, a casino operation. Students are not given the opportunity to include their judgment of what they know or can do that is of value to themselves, and their teachers, in future learning and instruction, as is done with essays, problem solving, and projects. This is easily done with knowledge and judgment scoring (KJS) of multiple-choice tests.


(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, August 13, 2014

Test Score Reliability - TMC and IRT

                                                                9
The main purpose of this post is to investigate the similarities between traditional multiple-choice (TMC), or classical test theory (CTT), and item response theory (IRT). The discussion is based on TMC and IRT as the math is simpler than when using knowledge and judgment scoring (KJS) and the IRT partial credit model (PCM). The difference is that TMC and IRT input marks at the lowest levels of thinking; resulting in a traditional ranking. KJS and PCM input the same marks at all levels of thinking; resulting in a ranking plus a quality indication of what a student actually knows and understands that is of value to that student (and teacher) in further instruction and learning.

I applied the instructions in the Winsteps Manual, page 576, for checking out the Winsteps reliability estimate computation, to the Nursing124 data used in the past several posts (22 students and 21 items). Table 32 is a busy table that is discussed in the next several posts. The two estimates for test reliability (0.29 and 0.28, orange) are identical based on TMC and IRT (considering rounding errors).

Table 32a shows the TMC test reliability estimated from the ratio of true variance to total variance. The total variance between scores, 4.08, minus the error variance within items, 2.95, yields the true variance, 1.13. The KR20 then completes the reliability calculation to yield 0.29 using normal values.

For an IRT estimate of test reliability, the values on a normal scale are converted to the logit scale (ln ratio w/r). In this case, the sum of item difficulty logits, ln ratio w/r, was -1.62 (Table 32b). This value is subtracted from each item difficulty logit value to shift the mean of the item distribution to the zero logit point (Rasch Adjust, Table 32b). Winsteps then optimizes the fit of the data (blue) to the perfect Rasch Model. Now comparable student ability and item difficulty values are in register at the same locations on a single logit scale. The 50% point on the normal scale is now at the zero location for both student ability and item difficulty.

The probability for each right mark (expected score ) in the central cells is the product of the respective marginal cells (blue) for item difficulty (Winsteps Table 13.1) and student ability (Winsteps Table 17.1). The sum of these probabilities (Table 32b, pink) is identical to the normal Score Mean (Table 32a, pink).

The “information” in each central cell, in Table 32c, was obtained by p*q or p * (1 - p) from Table 32b. Adding up the internal cells for each score yields the sum of information for that score.  

The next column shows the square root of the sum of information. This value inverted yields the conditional standard error of measurement (CSEM). The conditional variance (CVar) within each student ability measure is then obtained by reversing the equation for normal values in Table 32a: The CVar is obtained as the square of the CSEM instead of the CSEM being obtained as the square root of the CVar. The average of these values is the test model error variance (EV) in measures: 0.43.

The observed variance (OV) between measures is estimated in the exact same way as is done for normal scores: the variance between measures from Excel =VAR.P (0.61) or the square of the SD: 0.78 squared = 0.61.

The test reliability in measures {(OV –EV)/OV = (0.61 – 0.45)/0.61 = 0.28) is then obtained from the same equation for normal values: {total variance – error variance)/total variance = (4.08 – 2.96)/4.08 = 0.29, in table 32a. Normal and measure dimensions for the same value differ, but ratios do not, as a ratio has no dimension. TMC and IRT produced the same values for test reliability. As will KJS and the PCM.

(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, July 9, 2014

Small Sample Math Model - SEMs

                                                                      8
The test standard error of measurement (SEM) can be calculated in two ways: The traditional way is by relating the variance between student scores and within item difficulties; between an external column and the internal cell columns.

The second way harvests the variance conditioned on each student score and then sums the CSEM (SQRT(conditional student score error variance)) for the test. The first method links two properties: student ability and item difficulty. The second only uses one property: student ability.

I set up a model with 12 students and 11 items (see previous post and Table26.xlsm below). Extreme values of zero and 100% were excluded. Four samples with average test scores of 5, 6, 7 (Table 29), and 8 were created with the standard deviation (1.83) and the variance within item difficulties (1.83) held constant. This allowed the SEM to vary between methods.

The calculation of the test SEM (1.36) by way of reliability (KR20) is reviewed on the top level of Chart 73. The test SEM remained the same for all four tests.

My first calculation of the test SEM by way of conditional standard error of measurement (CSEM) began with the deviation of each mark from the student score (Table 29 center). I squared the deviations and summed to get the conditional variance for each score. The individual student CSEM is given as the square root of the conditional variance (the SD of the conditional variance). The test SEM (1.48) is then the sum of the student CSEM values.

[My second calculation was based on the binomial standard error of measurement given in Crocker, Linda, and James Algina, 1986, Introduction to Classical & Modern Test Theory, Wadsworth Group, pages 124-127.

By including the “correction for obtaining unbiased estimates of population variance”, (n/(n – 1), the SEM value increased from 1.48 to 1.55 (Table 29). This is a perfect match to the binomial SEM.]

The two SEMs are then based on different sample sizes and different assumptions. The traditional SEM (1.36) is based on the raggedly distributed small sample size in hand. The binomial SEM (1.55) assumes a perfectly normally distributed large theoretical population.

[Variance calculations (variance is additive):

  • Test variance: Score deviations from the test mean (as counts), squared, and summed = a sum of squares (SS). SS/N = MSS or variance: 3.33. {Test SD = SQRT(Var) = 1.83. Test SEM = 1.36.}

  • Conditional error variance: Deviations from the student score (as a percent), squared, and summed = the conditional error variance (CVar) for that student score. {Test SEM = Average SQRT(CVar) = 1.48 (n) and 1.55 (n-1)}

  • Conditional error variance: Variance Within the Score row (Excel, VAR) x (n or n - 1) = the CVar for that student score. {Test SEM VAR.P = 1.48 and VAR.S = 1.55.] 
Squaring values produces curved distributions (Chart 73). The curves represent the possible values. They do not represent the number of items or student scores having those values.

The True MSS = Total MSS – Error MSS = 3.33 -1.83 = 1.50, involves subtracting a convex distribution centered on the average test score from a concave distribution centered on the maximum value of 0.25 (not on the average item difficulty).

The student score MSS is at a maximum when the item error SS is at a minimum. The error MSS is at a maximum (0.25) when the student score MSS is at a minimum (0.00). This makes sense. This item is perfectly aligned with the student score distribution at a point where there is not differing from the average test score.

The KR20 is then a ratio of the True MSS/Total MSS, 1.50/3.33 = 0.50. [KR20 ranges from 0 to 1, not reproducible to fully reproducible]. The test SEM is then a portion, SQRT(1 – KR20) of the SD [also 1.83 in this example, SQRT(3.33)] = SQRT(1 – 0.50) * 1.83 = 1.36.

I was able to set the test SEM estimates using KR20 all to 1.36 for all four tests by setting the SD of student scores and the item error MSS to constant values by switching a 0 and 1 pair in student mark patterns. [The SD and the item error MSS do not have to be the same values.]

All possible individual student score binomial CSEM values for a test with 11 items are listed in Table 30. The CSEM is given as the SQRT(conditional variance). The conditional variance is: (X * (n – X))/(n – 1) or n*(pg) * (n/(n - 1)). There is then no need to administer a test to calculate a student score binomial CSEM value. There is a need to administer a test to find the test SEM. The test SEM (Table 29) is the sum of these values, 1.55.

The student CSEM and thus the test SEM values are derived only from student mark patterns. They differ from the test SEM values derived from the KR20 (Table 31). With KR20 derived values held constant, the binomial CSEM derived values for SEM decreased with higher test scores. This makes sense. There is less room for chance events. Precision increases with higher test scores.

Given a choice, a testing company would select the KR20 method using CTT analysis to report test SEM results.

[The same SEM values for tests with 5 right and 6 right resulted from the fact that the median score was 5.5. The values for 5 right and 6 right fall an equal distance from the mean on either side. Therefore 5 and 6 or 6 and 5 both add up to 11.]

I positioned the green curve on Chart 73 using the above information.

A CSEM value is independent from the average test score and item difficulties. (Swapping paired 0s and 1s in student mark patterns to adjust the item error variance made no difference in the CSEM value.) The average of the CSEM values, the test SEM, is dependent on the number of items on the test with each value. If all scores are the same, the CSEMs and the SEM will be the same (Tables 30 and 31).

I hope at this stage to have a visual mathematical model that is robust enough to make meaningful comparisons with the Rasch IRT model. I would like to return to this model and do two things (or have someone volunteer do it):

  1. Combine all the features that have been teased out, in Chart 72 and Chart 73, into one model.
  2. Animate the model in a meaningful way with change gages and history graphs.
Now to return to the Nursing data that represent the real classroom, filled with successful instruction, learning, and assessment.

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request. (Files hosted at nine-patch.com are also being relocated now that Nine-Patch Multiple-Choice, Inc has been dissolved.)

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, June 18, 2014

Small Sample Math Model - Item Discrimination

                                                                   #7 
The ability of an item to place students into two distinct groups is not a part of the mathematical model developed in the past few posts. Discrimination ability, however, provides insight into how the model works. A practical standardized test must have student scores spread out enough to assign desired rankings. Discriminating items produce this spread of student scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does not tell us what a student actually knows that is useful and meaningful to the student as the basis for further learning and effective instruction. This can be done with Knowledge and Judgment Scoring and the partial credit Rasch IRT model using the very same tests. This post is using traditional scoring as it simplifies the analysis (and the model) to just right and wrong, no judgment or higher levels of thinking are required of students.

I created a simple data set of 12 students and 11 items (Table 26) with an average score of 5. I then modified this set to produce average scores of 6, 7, and 8 (Table 27). [This can also be considered as the same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a test with an average score of 8 was 1.83. I then adjusted the MSS for the other three grades to match this value. A right and a wrong mark were exchanged in a student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85, 1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495 and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected, with the average test score. The average item discrimination (Pearson r and PBR) (Table 28) was stable. In general, with a few outliers in this small data set, the most discriminating items had the same difficulty as the average test score. [This behavior for the item discrimination to be maximized at the average test score is a basic component of the Rasch IRT model, which by design limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that items tend to be most discriminating when they have a difficulty near the average test score (not just near 50%).

The question is often asked, “Do tests have to be designed for an average score of 50%?”  If the SD remains the same, I found no difference in the KR20 or SEM. [The observed SD is ignored by the Rasch IRT model used by many states for test analysis.]

The maximum item discrimination value of 0.64 was always associated with an item mark pattern in which all right marks and all wrong marks were in two groups with no mixing of right and wrong marks. I loaded a perfect Guttman mark pattern and found that 0.64 was the maximum corrected value for this size of data set. (The corrected values are better estimates than the uncorrected values in a small data set.)

Items of equal difficulty can have very different discrimination values. In Table 26, three items have a difficulty of 7 right marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in estimating test reliability by deleting an item and recalculating the test reliability to find the effect of any item in a test. The VESEngine (free download below) includes this feature: Test Reliability (TR) toggle button. Test reliability (KR20) and item discrimination (PBR) are interdependent on student and item performance. A change in one usually results in a change in one or more of the other factors. [Student ability and item difficulty are considered independent using the Rasch model IRT analysis.] {I have yet to determine if comparing CTT to IRT is a case of comparing apples to apples, apples to oranges or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of the error MSS (black curve) and the portion of right and wrong marks (red curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at each end. Both are insensitive to the position of right marks in an item mark pattern. The average score for right and for wrong marks is sensitive to the mark pattern as the difference between these two values determines part of the item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can use a range of average test scores. In this example there was no difference in the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis transforms normal counts into logits that have only one reference point of 50% (zero logit) when student ability and item difficulty are positioned on one common scale. This point is then extended in either direction by values that represent equal student ability and item discrimination (50% right) from zero to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, May 7, 2014

Test Scoring Math Model - Precision

                                                               #6
The precision of the average test score can be obtained from the math model in two ways: directly from the mean sum of squares (MSS) or variance, and traditionally, by way of the test reliability (KR20).

I obtained the precision of each individual student test score from the math model by taking the square root of the sum of squared deviations (SS) within each score mark pattern (green, Table 25). The value is called the conditional standard error of measurement (CSEM) as it sums deviations for one student score (one condition), not for the total test.

I multiplied the mean sum of squares (MSS) by the number of items averaged (21) to yield the SS (0.15 x 21 = 3.15 for a 17 right mark score) (or I could have just added up the squared deviations). The SQRT(3.15) = 1.80 right marks for the CSEM. Some 2/3 of the time a re-tested score of 17 right marks can be expected to fall between 15.20 and 18.80 (15 and 19) right marks (Chart 70).

The test Standard Error of Measurement (SEM) is then the average of the 22 individual CSEM values (1.75 right marks or 8.31%).

The traditional derivation of the test SEM (the error in the average test score) combines the test reliability (KR20) and the SD (spread) of the average test score.

The SD (2.07) is from the SQRT(MSS, 4.08) between student scores. The test reliability (0.29) is the ratio of the true variance (MSS, 1.12) to the total variance (MSS, 4,08) between student scores (see previous post).

The expectation is that the greater the reliability of a test, the smaller the error in estimating the average test score. An equation is now needed to transform variance values on the top level of the math model to apply to the lower linear level.

SEM = SQRT(1 – KR20) * SD = SQRT(1 – 0.29) * 2.07 = SQRT(0.71) * 2.07 = 0.84 * 2.07 = 1.75 right marks.

The operation of “1 – KR20” aligns the value of 0.71 to extract the portion of the SD that represents the SEM. If the test reliability goes up, the error in estimating the average test score (SEM) goes down.

Chart 70 shows the variance (MSS), the SS, and the CSEM based on 21 items, for each student score. It also shows the distribution of the CSEM values that I averaged for the test SEM.

The individual CSEM is highest (largest error, poorer precision) when the student score is 50% (Charts 65 and 70). Higher student scores yield lower CSEM values (better precision). This makes sense.

The test SEM (the average of the CSEM values) is related to the distribution of student test scores (purple dash, Chart 70). Adding easy items (easy in the sense that the students were well prepared) decreases error, improves precision, reduces the SEM.

- - - - - - - - - - - - - - - - - - - - - 


The Best of the Blog - FREE
  • The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
  • This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.
  • Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xlsQuick Start

Wednesday, April 23, 2014

Test Scoring Math Model - Reliability

                                                                5
An estimate of the reliability or reproducibility of a test can be extracted from the variation within the tabled right marks (Table 25). The variance from within the item columns is related to the variance from within the student score column.

The error within items variance (2.96) and total variance (MSS) between student scores (4.08) are both obtained from columns in Table 25b (blue, Chart 68). The true variance is then 4.08 – 2.96 = 1.12.

The ratio of true variance to the total variance between scores (1.12/4.08) becomes an indicator of test reliability (0.28). This makes sense.

A test with perfect reliability (4.08/4.08 = 1.0) would have no variation, error variance = 0, within the item columns in Table 25. A test with no reliability (0.0/4.08) would show equal values (4.08) for within item columns, and between test scores.

The KR20 formula then adjusts the above value (0.28 x 21/20) to 0.29 [from a large population (n) to a small sample value (n-1)]. The KR20 ratio has no unit labels (“var/var” = “”). All of the above takes place on the upper (variance) level of the math model.

Doubling the number of students taking the test (Chart 69) has no effect on reliability. Doubling the number of items doubles the error variance but increases the total variance by the square. The test reliability increases from 0.29 to 0.64.

The square root of the total variance between scores (4.08) yields the standard deviation (SD) for the score distribution [(2.02 for (n) and 2.07 for (n-1)] on the lower floor of the math model.

- - - - - - - - - - - - - - - - - - - - - 

The Best of the Blog - FREE
  • The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
  • This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.
  • Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xlsQuick Start