Wednesday, December 10, 2014

Information Functions - Adding Unbalanced Items

                                                                13
Adding 22 balanced items to Table 33 of 21 items, in the prior post, resulted in a similar average test score (Table 36) and the same item information functions (the added items were duplicates of those in the first Nurse124 data set of 21 items.) What happens if an unbalance set of 6 items is added? I just deleted the 16 high scoring additions from Table 36. Both balanced additions (Table 36) and unbalanced additions (Table 39) had the same extended range of item difficulties (5 to 21 right marks, or 23% to 95% difficulty).

Table 33
Table 36
Table 39

Adding a balanced set of items to the Nurse124 data set kept the average score the same: 80% and 79% (Table 36). Adding a set of more difficult items to the Nurse124 data decreased the average score to 70% (Table 39) and decreased student scores. Traditionally, a student’s overall score is then the average of the three test scores: 80%, 79% and 70% or 76% for an average student (Tables 33, 36, and 39). An estimate of a student’s “ability” is thus directly dependent upon his test scores which are dependent upon the difficulty of the items on each test. This score is accepted as a best estimate of the student’s true score. This value is a best guess of future test scores. This makes common sense, that past is a predictor of future performance.

 [Again a distinction must be made between what is being measured by right mark scoring (0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a test the student may not be able to read or understand. The other also indicates the quality of each student’s knowledge; the ability to make meaningful use of knowledge and skills. Both methods of analysis can use the exact same tests. I continue to wonder why people are still paying full price but harvesting only a portion of the results.]

The Rasch model IRT takes a very different route to “ability”. The very same student mark data sets can be used. Expected IRT student scores are based on the probability that half of all students with a given ability location will correctly mark a question with a comparable difficulty location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.)  [The location starts from the natural log of a ratio of right/wrong score and wrong/right difficulty. A convergence of score and difficulty yields the final location. The 50% test score becomes the zero logit location, the only point right mark scoring and IRT scores are in full agreement.]

The Rasch model IRT converts student scores and item difficulties [in the marginal cells of student data] into the probabilities of a right answer (Table 33b). [The probabilities replace the marks in the central cell field of student data.] It also yields raw student scores, and their conditional standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather than the count of right marks. (For more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student test score; a student with a given ability can obtain a range of scores on future tests without affecting his ability location. A calibrated item can yield a range of difficulties on future tests without affecting its difficulty calibrated location. This makes sense only in relation to the trust you can have in the person interpreting IRT results; that person’s skill, knowledge, and (most important) experience at all levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look right”, can be eliminated from the data set. Also the same data set (Table 33, Table 36, and Table 39) can be treated differently if it is classified as field test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively equilibrated test results to optimize federal dollars during the NCLB era by showing gradual continuing improvement.  It is time to end the ranking of students by right mark scoring (0,1 scoring) and include KJS, or PCM (0,1,2 scoring) [that about every state education department has: Winsteps], so that standardized testing yields the results needed to guide student development: the main goal of the CCSS movement.


The need to equilibrate a test is an admission of failure. The practice has become “normal” because failure is so common. It opened the door to cheating at state and national levels. [To my knowledge no one has been charged and convicted of a crime for this cheating.] Current computer adaptive testing (CAT) hovers about the 50% level of difficulty. This optimizes psychometric tools. Having a disinterested party outside of the educational community doing the assessment analysis and online CAT reduce the opportunity to cheat. They do not IMHO optimize the usefulness of the test results. End-of-course tests are now molding standardized testing into an instrument to evaluate teacher effectiveness rather than assess student knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, November 12, 2014

Information Functions - Adding Balanced Items

                                                               12
I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information functions (IIF). This post makes use of that observation to improve the Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post, contained no items with difficulties below 45%. The item information functions (IIF) were also skewed (Chart 75). This is not the symmetrical display associated with the Rasch IRT model. I reasoned that adding a balanced set of items would increase the number of IFFs without changing the average item difficulty.

Table 36a shows the addition of a balanced set of 22 items to the Nurse124 data set of 21 items. As each lower ranking item was added, one or more high ranking items were added to keep the average test score near 80%. This table added six lower ranking items and 16 higher scoring items resulting in an average score of 79% and 43 items total.

Table 36
The average item difficulty for the Nurse124 data set was 17.57 and the expanded set was 17.28. The average test score of 80% came in as 79%. Student scores (ability) also remained about the same. [I did not take the time to tweak the additions for a better fit.] Both item difficulty and student score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did change with the addition of more items (Chart 79 below). The number of cells containing information expanded from 99 to 204 cells. The average right count student score increased from 17 to 34.

Table 36c shows the resulting item information functions (IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set of 9 different student scores now contains 12 different scores, however the range of student scores is comparable between the two sets. This makes sense as the average test scores are similar and the student scores are also about the same.
Table 37
Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the student ability range of 12 rankings (student score right count/% right). The trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near the average test score of 79%. This was expected as the maximum information value within an IIF occurs when the item difficulty and student ability score match. [The three bottom traces on Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly skewed the further the IIFs are away from this maximum, 11/50%, trace (blue square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average test score, produced the identical total information (1.41) from both the Nurse124 and the supplemented data sets. But these values also drifted apart for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum information (2.17). Here again is evidence that prompts psychometricians to work closely to the 50% or zero logit point to optimize their tools when working on low quality data (limiting scoring only to right counts rather than also offering students the option to assess their judgment to report what is actually meaningful and useful; to assess their development toward being a successful, independent, high quality achiever). [Students that only need some guidance rather than endless “re-teaching”; that, for the most part, consider right count standardized tests a joke and a waste of time.]
Chart 78

Tabel 38
The test information function for the supplemented data set Is the sum of the information in all 17 item information functions (Table 38 and Chart 78). It took 16 easy items to balance 6 difficult items. The result was a marked increase in precision at the student score levels between 30/70% and 32/74%. [More at Rasch Model Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124 data, the supplemented data (adding a balanced set of items that keeps student ability and item difficulty unchanged), and the CTT and IRT data reduction methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM) for comparison. In general, both CTT (blue) and IRT inverted (red) produced  comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to 2.54. The standard deviation (SD) between student test scores increased from 2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and 57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the student test score the lower the CTT CSEM, the more precise the student score measurement, the less error. That makes sense.

The above statement about a CTT CSEM must be related to a second statement that the more item information, the greater the precision of measurement by the item at this student score rank. The first statement harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks) in Table 36c.

The binomial variance CTT CSEM view is then comparable to the reciprocal or inverted (1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue, CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced similar results even with an average test score of 79% that is 29 percentage points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information functions, in Table 36c from columns of probabilities (of right marks). Layering one IIF on top of another across the student score distribution yields the test information function (Chart 78).


The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting
a right answer that were generated from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet, at the count only right mark level, they deliver very similar results, with the exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, October 8, 2014

Customizing Test Precision - Information Functions

                                                               11

(Continued from the prior two posts.)

The past two posts have established that there is little difference between classical test theory (CTT) and item response theory (IRT) in respect to test reliability and conditional error of measurement (CSEM) estimates (other than the change in scales). IRT now is the analysis of choice for standardized tests. The Rasch model IRT is the easiest to use and also works well with small data sets including classroom tests. How two normal scales for student scores and item difficulties are combined onto one IRT logit scale is no longer a concern to me, other than the same method must be used throughout the duration of an assessment program.

Table 33
What is new and different from CTT is an additional insight from the IRT data in Table 32c (information p*q values). I copied Table 32 into Table 33 with some editing. I colored the cells holding the maximum amount of information (0.25) yellow in Table 33c. This color was then carried back to Table 33a, Right and Wrong Marks. [Item Information is related to the marginal cells in Table 33a (as probabilities), and not to the central cell field (as mark counts).] The eleven item information functions (in columns) were re-tabled into Table 34 and graphed in Chart 75. [Adding the information in rows yields the student score CSEM in Table 33c.]

Table 34
Chart 75
The Nurse124 data yielded an average test score of 16.8 marks or 80%. This skewed the item information functions away from the 50% or zero logit difficulty point (Chart 75). The more difficult the item, the more information developed, from 0.49 to 1.87 for 95% right count to a maximum at 54% and 45% right count. [No item on the test had a difficulty of 50%.]

Table 35
Chart 76
The sum of information (59.96) by item difficulty level and student score level is tabled in Table 35 and plotted as the test information function in Chart 76. This test does not do a precise job of assessing student ability. The test was most precise (19.32) at the 16 right count/76% right location. [Location can be designated by measure (logit), input raw score (red) or output expected score (Table 33b).]

The item with an 18 right count/92% right difficulty (Table 35) did not contribute the most information individually but did as a group of three items (9.17).  The three highest scoring, easiest, items (counts of 19, 20, and 21) are just too easy for a standardized test but may be important survey items needed to verify knowledge and skills for this class of high performing students. None of these three items reached an information level maximum of 1/4. [It now becomes apparent how items can be selected to produce a desired test information function.]

The more information available is interpreted as greater precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q)) on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT + WG)^2 or (3 x 3)/36 = 1/4].

Each item information function spans the range of student scores on the test (Chart 76). Each item information function measures student ability most precisely near the point that item difficulty and student ability match (50% right) along the IRT S-curve. [The more difficult an item, the more ability students must have to mark correctly 50% of the time. Student ability is the number correct on the S-curve. Item difficulty is the number wrong on the S-curve (see more at Rasch Model Audit).]   

Extracting item information functions from a data table provides a powerful tool (a test information function) for psychometricians to customize a test (page 127, Maryland 2010). A test can be adjusted for maximum precision (minimum CSEM) at a desired cut point.

The bright side of this is that the concept of “information” (not applicable to CTT), and the ability to put student ability and item difficulty on one scale, gives psychometricians powerful tools. The dark side is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the NCLB era, as psychometrics has made marked improvements, the student mark data it is being supplied has remained in the casino arena: Mark an answer to each question (even if you cannot read or understand the question), do not guess, and hope for good luck on test day.

The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak psychometric performance. Standardized testing has migrated away from letting students report what they know and can do to a lottery that compares their performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day. 

The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.] 

It does not have to be that way. Knowledge and Judgment Scoring (classroom friendly) and the partial credit Rasch model (that is included in the software states use) can both update traditional multiple-choice to the levels of thinking required by the common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT. 


- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, September 10, 2014

Conditional Standard Error of Measurement - Precision

                                                              10    
(Continued from prior post.)

Table 32a contains two estimates (red) of the test standard error of measurement (SEM) that are in full agreement.  One estimate, 1.75, is from the average of the conditional standard error of measurements (CSEM, green) for each student raw score. The traditional estimate, 1.74, uses the traditional test reliability, KR20. No problem here.

The third estimate of the test SEM in Table 32c is different. It is based on CSEM values expressed in logits (the natural log, 2.718) rather than on the normal scale. The values are also inverted in relation to the traditional values in Table 32 (Chart 74). There is a small but important difference. The IRT CSEM values are much more linear that the CTT CSEM values. Also the center of this plot is the mean of the number of items (Chart 30, prior post), not the mean of the item difficulties or student scores. [Also most of this chart was calculated as most of these relationships do not require actual data to be charted. Only nine score levels came from the Nurse124 data.]

Chart 74 shows the binomial CSEM values for CTT (normal) and IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in logits) = 1/(SEM(Raw Score)”, 2007. I then adjusted each of these so the corresponding curves, on the same scale, crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The extreme values for no right and all right were not included. CSEM values for extreme values go to zero or to infinity with the following result:

“An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm

Precision is then not a constant across the range of student scores for both methods of analysis. The test SEM of 0.64 logits is comparable to 1.74 counts on the normal scale.

The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values estimates the test SEM as 1.75 counts out of a range of 21 items. This is less than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and useful.

For IRT analysis, the CSEM indicates the degree that the data fit the perfect Rasch model (3). A better fit also results in more believable and useful results.

“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation of an imagined error distribution representing the possible distribution of observed values around their “true” theoretical value. This precision is based on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the measures or estimates correspond to a reference standard outside the data, in this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm

Precision also has some very practical limitations when delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very neat tables showing the number of items that must be on a test to obtain a desired degree of precision expressed in logits and in confidence limits. The closer the test “targets” an average score of 50%, the fewer items needed for a desired precision.

The two top students, with the same score of 20, missed items with different difficulties. They both yield the same CSEM. The CSEM ignores the pattern of marks and the difficulty of items. A CSEM value obtained in this manner is related only to the raw score. Absolute values for the CSEM are sensitive to item difficulty (Table 23a and 23b).

The precision of a cut score has received increasing attention during the NCLB era. In part, court actions have made the work of psychometricians more transparent. The technical report for a standardized test can now exceed 100 pages. There has been a shift of emphasis from test SEM, to individual score CSEM, to IRT information as an explanation of test precision.

 “(Note that the test information function and the raw score error variance at a given level of proficiency [student  score], are analogous for the Rasch model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The conditional standard error of measurement is the inverse of the information function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.

I cannot end this without repeating that this discussion of precision is based on traditional multiple-choice (TMC) that only ranks students, a casino operation. Students are not given the opportunity to include their judgment of what they know or can do that is of value to themselves, and their teachers, in future learning and instruction, as is done with essays, problem solving, and projects. This is easily done with knowledge and judgment scoring (KJS) of multiple-choice tests.


(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, August 13, 2014

Test Score Reliability - TMC and IRT

                                                                9
The main purpose of this post is to investigate the similarities between traditional multiple-choice (TMC), or classical test theory (CTT), and item response theory (IRT). The discussion is based on TMC and IRT as the math is simpler than when using knowledge and judgment scoring (KJS) and the IRT partial credit model (PCM). The difference is that TMC and IRT input marks at the lowest levels of thinking; resulting in a traditional ranking. KJS and PCM input the same marks at all levels of thinking; resulting in a ranking plus a quality indication of what a student actually knows and understands that is of value to that student (and teacher) in further instruction and learning.

I applied the instructions in the Winsteps Manual, page 576, for checking out the Winsteps reliability estimate computation, to the Nursing124 data used in the past several posts (22 students and 21 items). Table 32 is a busy table that is discussed in the next several posts. The two estimates for test reliability (0.29 and 0.28, orange) are identical based on TMC and IRT (considering rounding errors).

Table 32a shows the TMC test reliability estimated from the ratio of true variance to total variance. The total variance between scores, 4.08, minus the error variance within items, 2.95, yields the true variance, 1.13. The KR20 then completes the reliability calculation to yield 0.29 using normal values.

For an IRT estimate of test reliability, the values on a normal scale are converted to the logit scale (ln ratio w/r). In this case, the sum of item difficulty logits, ln ratio w/r, was -1.62 (Table 32b). This value is subtracted from each item difficulty logit value to shift the mean of the item distribution to the zero logit point (Rasch Adjust, Table 32b). Winsteps then optimizes the fit of the data (blue) to the perfect Rasch Model. Now comparable student ability and item difficulty values are in register at the same locations on a single logit scale. The 50% point on the normal scale is now at the zero location for both student ability and item difficulty.

The probability for each right mark (expected score ) in the central cells is the product of the respective marginal cells (blue) for item difficulty (Winsteps Table 13.1) and student ability (Winsteps Table 17.1). The sum of these probabilities (Table 32b, pink) is identical to the normal Score Mean (Table 32a, pink).

The “information” in each central cell, in Table 32c, was obtained by p*q or p * (1 - p) from Table 32b. Adding up the internal cells for each score yields the sum of information for that score.  

The next column shows the square root of the sum of information. This value inverted yields the conditional standard error of measurement (CSEM). The conditional variance (CVar) within each student ability measure is then obtained by reversing the equation for normal values in Table 32a: The CVar is obtained as the square of the CSEM instead of the CSEM being obtained as the square root of the CVar. The average of these values is the test model error variance (EV) in measures: 0.43.

The observed variance (OV) between measures is estimated in the exact same way as is done for normal scores: the variance between measures from Excel =VAR.P (0.61) or the square of the SD: 0.78 squared = 0.61.

The test reliability in measures {(OV –EV)/OV = (0.61 – 0.45)/0.61 = 0.28) is then obtained from the same equation for normal values: {total variance – error variance)/total variance = (4.08 – 2.96)/4.08 = 0.29, in table 32a. Normal and measure dimensions for the same value differ, but ratios do not, as a ratio has no dimension. TMC and IRT produced the same values for test reliability. As will KJS and the PCM.

(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, July 9, 2014

Small Sample Math Model - SEMs

                                                                      8
The test standard error of measurement (SEM) can be calculated in two ways: The traditional way is by relating the variance between student scores and within item difficulties; between an external column and the internal cell columns.

The second way harvests the variance conditioned on each student score and then sums the CSEM (SQRT(conditional student score error variance)) for the test. The first method links two properties: student ability and item difficulty. The second only uses one property: student ability.

I set up a model with 12 students and 11 items (see previous post and Table26.xlsm below). Extreme values of zero and 100% were excluded. Four samples with average test scores of 5, 6, 7 (Table 29), and 8 were created with the standard deviation (1.83) and the variance within item difficulties (1.83) held constant. This allowed the SEM to vary between methods.

The calculation of the test SEM (1.36) by way of reliability (KR20) is reviewed on the top level of Chart 73. The test SEM remained the same for all four tests.

My first calculation of the test SEM by way of conditional standard error of measurement (CSEM) began with the deviation of each mark from the student score (Table 29 center). I squared the deviations and summed to get the conditional variance for each score. The individual student CSEM is given as the square root of the conditional variance (the SD of the conditional variance). The test SEM (1.48) is then the sum of the student CSEM values.

[My second calculation was based on the binomial standard error of measurement given in Crocker, Linda, and James Algina, 1986, Introduction to Classical & Modern Test Theory, Wadsworth Group, pages 124-127.

By including the “correction for obtaining unbiased estimates of population variance”, (n/(n – 1), the SEM value increased from 1.48 to 1.55 (Table 29). This is a perfect match to the binomial SEM.]

The two SEMs are then based on different sample sizes and different assumptions. The traditional SEM (1.36) is based on the raggedly distributed small sample size in hand. The binomial SEM (1.55) assumes a perfectly normally distributed large theoretical population.

[Variance calculations (variance is additive):

  • Test variance: Score deviations from the test mean (as counts), squared, and summed = a sum of squares (SS). SS/N = MSS or variance: 3.33. {Test SD = SQRT(Var) = 1.83. Test SEM = 1.36.}

  • Conditional error variance: Deviations from the student score (as a percent), squared, and summed = the conditional error variance (CVar) for that student score. {Test SEM = Average SQRT(CVar) = 1.48 (n) and 1.55 (n-1)}

  • Conditional error variance: Variance Within the Score row (Excel, VAR) x (n or n - 1) = the CVar for that student score. {Test SEM VAR.P = 1.48 and VAR.S = 1.55.] 
Squaring values produces curved distributions (Chart 73). The curves represent the possible values. They do not represent the number of items or student scores having those values.

The True MSS = Total MSS – Error MSS = 3.33 -1.83 = 1.50, involves subtracting a convex distribution centered on the average test score from a concave distribution centered on the maximum value of 0.25 (not on the average item difficulty).

The student score MSS is at a maximum when the item error SS is at a minimum. The error MSS is at a maximum (0.25) when the student score MSS is at a minimum (0.00). This makes sense. This item is perfectly aligned with the student score distribution at a point where there is not differing from the average test score.

The KR20 is then a ratio of the True MSS/Total MSS, 1.50/3.33 = 0.50. [KR20 ranges from 0 to 1, not reproducible to fully reproducible]. The test SEM is then a portion, SQRT(1 – KR20) of the SD [also 1.83 in this example, SQRT(3.33)] = SQRT(1 – 0.50) * 1.83 = 1.36.

I was able to set the test SEM estimates using KR20 all to 1.36 for all four tests by setting the SD of student scores and the item error MSS to constant values by switching a 0 and 1 pair in student mark patterns. [The SD and the item error MSS do not have to be the same values.]

All possible individual student score binomial CSEM values for a test with 11 items are listed in Table 30. The CSEM is given as the SQRT(conditional variance). The conditional variance is: (X * (n – X))/(n – 1) or n*(pg) * (n/(n - 1)). There is then no need to administer a test to calculate a student score binomial CSEM value. There is a need to administer a test to find the test SEM. The test SEM (Table 29) is the sum of these values, 1.55.

The student CSEM and thus the test SEM values are derived only from student mark patterns. They differ from the test SEM values derived from the KR20 (Table 31). With KR20 derived values held constant, the binomial CSEM derived values for SEM decreased with higher test scores. This makes sense. There is less room for chance events. Precision increases with higher test scores.

Given a choice, a testing company would select the KR20 method using CTT analysis to report test SEM results.

[The same SEM values for tests with 5 right and 6 right resulted from the fact that the median score was 5.5. The values for 5 right and 6 right fall an equal distance from the mean on either side. Therefore 5 and 6 or 6 and 5 both add up to 11.]

I positioned the green curve on Chart 73 using the above information.

A CSEM value is independent from the average test score and item difficulties. (Swapping paired 0s and 1s in student mark patterns to adjust the item error variance made no difference in the CSEM value.) The average of the CSEM values, the test SEM, is dependent on the number of items on the test with each value. If all scores are the same, the CSEMs and the SEM will be the same (Tables 30 and 31).

I hope at this stage to have a visual mathematical model that is robust enough to make meaningful comparisons with the Rasch IRT model. I would like to return to this model and do two things (or have someone volunteer do it):

  1. Combine all the features that have been teased out, in Chart 72 and Chart 73, into one model.
  2. Animate the model in a meaningful way with change gages and history graphs.
Now to return to the Nursing data that represent the real classroom, filled with successful instruction, learning, and assessment.

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request. (Files hosted at nine-patch.com are also being relocated now that Nine-Patch Multiple-Choice, Inc has been dissolved.)

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, June 18, 2014

Small Sample Math Model - Item Discrimination

                                                                   #7 
The ability of an item to place students into two distinct groups is not a part of the mathematical model developed in the past few posts. Discrimination ability, however, provides insight into how the model works. A practical standardized test must have student scores spread out enough to assign desired rankings. Discriminating items produce this spread of student scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does not tell us what a student actually knows that is useful and meaningful to the student as the basis for further learning and effective instruction. This can be done with Knowledge and Judgment Scoring and the partial credit Rasch IRT model using the very same tests. This post is using traditional scoring as it simplifies the analysis (and the model) to just right and wrong, no judgment or higher levels of thinking are required of students.

I created a simple data set of 12 students and 11 items (Table 26) with an average score of 5. I then modified this set to produce average scores of 6, 7, and 8 (Table 27). [This can also be considered as the same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a test with an average score of 8 was 1.83. I then adjusted the MSS for the other three grades to match this value. A right and a wrong mark were exchanged in a student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85, 1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495 and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected, with the average test score. The average item discrimination (Pearson r and PBR) (Table 28) was stable. In general, with a few outliers in this small data set, the most discriminating items had the same difficulty as the average test score. [This behavior for the item discrimination to be maximized at the average test score is a basic component of the Rasch IRT model, which by design limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that items tend to be most discriminating when they have a difficulty near the average test score (not just near 50%).

The question is often asked, “Do tests have to be designed for an average score of 50%?”  If the SD remains the same, I found no difference in the KR20 or SEM. [The observed SD is ignored by the Rasch IRT model used by many states for test analysis.]

The maximum item discrimination value of 0.64 was always associated with an item mark pattern in which all right marks and all wrong marks were in two groups with no mixing of right and wrong marks. I loaded a perfect Guttman mark pattern and found that 0.64 was the maximum corrected value for this size of data set. (The corrected values are better estimates than the uncorrected values in a small data set.)

Items of equal difficulty can have very different discrimination values. In Table 26, three items have a difficulty of 7 right marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in estimating test reliability by deleting an item and recalculating the test reliability to find the effect of any item in a test. The VESEngine (free download below) includes this feature: Test Reliability (TR) toggle button. Test reliability (KR20) and item discrimination (PBR) are interdependent on student and item performance. A change in one usually results in a change in one or more of the other factors. [Student ability and item difficulty are considered independent using the Rasch model IRT analysis.] {I have yet to determine if comparing CTT to IRT is a case of comparing apples to apples, apples to oranges or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of the error MSS (black curve) and the portion of right and wrong marks (red curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at each end. Both are insensitive to the position of right marks in an item mark pattern. The average score for right and for wrong marks is sensitive to the mark pattern as the difference between these two values determines part of the item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can use a range of average test scores. In this example there was no difference in the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis transforms normal counts into logits that have only one reference point of 50% (zero logit) when student ability and item difficulty are positioned on one common scale. This point is then extended in either direction by values that represent equal student ability and item discrimination (50% right) from zero to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.


Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.