Wednesday, December 10, 2014

Information Functions - Adding Unbalanced Items

                                                                13
Adding 22 balanced items to Table 33 of 21 items, in the prior post, resulted in a similar average test score (Table 36) and the same item information functions (the added items were duplicates of those in the first Nurse124 data set of 21 items.) What happens if an unbalance set of 6 items is added? I just deleted the 16 high scoring additions from Table 36. Both balanced additions (Table 36) and unbalanced additions (Table 39) had the same extended range of item difficulties (5 to 21 right marks, or 23% to 95% difficulty).

Table 33
Table 36
Table 39

Adding a balanced set of items to the Nurse124 data set kept the average score the same: 80% and 79% (Table 36). Adding a set of more difficult items to the Nurse124 data decreased the average score to 70% (Table 39) and decreased student scores. Traditionally, a student’s overall score is then the average of the three test scores: 80%, 79% and 70% or 76% for an average student (Tables 33, 36, and 39). An estimate of a student’s “ability” is thus directly dependent upon his test scores which are dependent upon the difficulty of the items on each test. This score is accepted as a best estimate of the student’s true score. This value is a best guess of future test scores. This makes common sense, that past is a predictor of future performance.

 [Again a distinction must be made between what is being measured by right mark scoring (0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a test the student may not be able to read or understand. The other also indicates the quality of each student’s knowledge; the ability to make meaningful use of knowledge and skills. Both methods of analysis can use the exact same tests. I continue to wonder why people are still paying full price but harvesting only a portion of the results.]

The Rasch model IRT takes a very different route to “ability”. The very same student mark data sets can be used. Expected IRT student scores are based on the probability that half of all students with a given ability location will correctly mark a question with a comparable difficulty location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.)  [The location starts from the natural log of a ratio of right/wrong score and wrong/right difficulty. A convergence of score and difficulty yields the final location. The 50% test score becomes the zero logit location, the only point right mark scoring and IRT scores are in full agreement.]

The Rasch model IRT converts student scores and item difficulties [in the marginal cells of student data] into the probabilities of a right answer (Table 33b). [The probabilities replace the marks in the central cell field of student data.] It also yields raw student scores, and their conditional standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather than the count of right marks. (For more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student test score; a student with a given ability can obtain a range of scores on future tests without affecting his ability location. A calibrated item can yield a range of difficulties on future tests without affecting its difficulty calibrated location. This makes sense only in relation to the trust you can have in the person interpreting IRT results; that person’s skill, knowledge, and (most important) experience at all levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look right”, can be eliminated from the data set. Also the same data set (Table 33, Table 36, and Table 39) can be treated differently if it is classified as field test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively equilibrated test results to optimize federal dollars during the NCLB era by showing gradual continuing improvement.  It is time to end the ranking of students by right mark scoring (0,1 scoring) and include KJS, or PCM (0,1,2 scoring) [that about every state education department has: Winsteps], so that standardized testing yields the results needed to guide student development: the main goal of the CCSS movement.


The need to equilibrate a test is an admission of failure. The practice has become “normal” because failure is so common. It opened the door to cheating at state and national levels. [To my knowledge no one has been charged and convicted of a crime for this cheating.] Current computer adaptive testing (CAT) hovers about the 50% level of difficulty. This optimizes psychometric tools. Having a disinterested party outside of the educational community doing the assessment analysis and online CAT reduce the opportunity to cheat. They do not IMHO optimize the usefulness of the test results. End-of-course tests are now molding standardized testing into an instrument to evaluate teacher effectiveness rather than assess student knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.