Wednesday, July 24, 2013

Visual Education Statistics - Equipercentile Equating


                                                             19
Equipercentile equating frequently appears in NCLB testing articles. I took a normal distribution of 40 student scores (average of 50%) with a standard deviation (SD) of 10% (new test) and equated it to one with a SD of 20% (reference test) to see how equipercentile equating works (Chart 54).
First I grouped the scores into 5%-ranges. I then matched the new test groups to the reference test groups (Chart 55). The result was a bit messy. 
A re-plot of the twenty 5%-groups shows the new test has been sliced into groups that contain twice the count as the reference test, but which match, in general, the reference test every other group (Chart 56).
Smoothing by inspection resulted in Chart 57. A perfect fit was obtained with the reference test with the exception of rounding errors. 
Smoothing on “small samples of test-takers” does make a difference in the accuracy of equipercentile equating. “The improvement that resulted from smoothing the distributions before equating was about the same as the improvement that resulted from doubling the number of test-takers in the samples” (Livingston, 2004, page 21). [See Post 13, Chart 34, in this series for the effect of doubling the number of test-takers on the SD and SEM.]

I then entered the values from Charts 54, 55, and 57 into my visual education statistics engine (VESE). Equipercentile equating the student scores transformed the new test into the reference test including the related group statistics (Chart 58).
The three 5%-groupings show almost identical values. Grouping reduced the item discrimination ability (PBR) of the reference test a small amount as grouping reduced the range of the student score distribution. This works very nicely in a perfect world, however, real test scores do not align perfectly with the normal curve.

A much more detailed description of equipercentile equating and smoothing is found in (Livingston, 2004, pages 17-24). The easy to follow illustrated examples include real test results and related problems, with a troubling resolution: “Often the choice of an equating method comes down to a question of what is believable, given what we know about the test and the population of test-takers.”

This highly subjective statement was acceptable in 2004. NCLB put pressure on psychometricians to do better. The CCSS movement has raised the bar again. The subjectivity expressed here is, IMHO, similar to that in using the Rasch model IRT analysis that has been popular with state departments of education. Both without IRT and with IRT methods base results on a relationship to an unknowable “population of test-takers”. Both methods pursue manipulations that end up with the results “looking right”.

[The classroom equivalent of this, practiced in Missouri prior to NCLB, was to divide the normal curve into parts for letter grades. One version was to assign grades to ranked student scores with uniform slices. True believers assigned a double portion to “C”. Every class was then a “normal” class with no way to know what the raw scores were or what students actually knew or could do.]  

It does not have to be that way. Let students report what they actually know and can do. Let them report what they trust will be of value for further learning and for application in situations other than in which they learned. Do multiple-choice right. Get results comparable to essay, project, report, and research. Promote student development. Knowledge and Judgment Scoring and partial credit Rasch model analysis do this. Guessing is no longer needed. Forced guessing should not be tolerated, IMHO.

The move to performance based learning may, this time, not only compete with the CCSS movement assessments, but replace them. The system that is the leanest, the most versatile in meeting student needs, and is immune to erratic federal funding, and thus most effective, will survive.
- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, July 10, 2013

Visual Education Statistics - Equating


                                                             18
The past few posts have shown that if two tests have the same student score standard deviation (SD) they are easy to combine or link. Both tests will have the same student score distribution on the same scale.

Equating is then a process of finding the difference between the average test scores and applying this value to one of the two sets of test scores. Add the difference in average test score to the lower set of scores, or subtract it from the higher set to combine the two sets of test scores.

This can be done whenever the SDs are within acceptable limits (considering, all factors that may affect the test results, the expected results, and the intended use of the results). This is IMHO a very subjective judgment call to be made by the most experienced person available.

There are two other situations: same average test score but the different SDs are beyond acceptable limits, and both test score and SD differences are beyond acceptable limits for the two tests. In both cases we need to equate the two different SDs, the two different distributions of student scores.

Chart 48 is a re-tabling of Chart 44. The x-axis in Chart 48 shows the set Standard Deviation (SD) used in the VESE tables in prior posts. Equating a low SD test (10) to a high SD test (30) has different effects then equating a high SD test (30) to a low SD test (10). The first improves the test performance; the second reduces the test performance.

There is then a bias to raise the low SD test to the high SD test. “The test this year was more difficult than the test last year,” was the NCLB explanation from Texas, Arkansas, and New York. [It was not that the students this year were less prepared.]

The most frequent way I have seen mapping (Livingston, 2004, figure 2, page 14) done is to plot the scores of the test to be equated on the x-axis and the scores of the reference test on the y-axis. The equate line for two tests with similar average test scores and SDs is a straight line from zero through the 50% point on both axes (Chart 49).

If the average test scores are similar but the SDs are different, the equate line becomes tilted to expand (Chart 50) or contract (Chart 51) the equated values to match the reference test. Mapping from a low SD test to a higher SD tests leaves gaps. Mapping from a high SD test to a low SD tests produces clumping, in part, from rounding errors.

Mapping a new difficult test to an easier reference test with the same SD increases the values on the equating line, as well, as truncates it. Any new test scores over 30 on Chart 52 have no place to be plotted of the reference test scale. 

The equating with an increase in both SD and average test score expands the distribution and truncates the equating line even more (Chart 52). A comparison of the two above situations as parallel lines (Chart 53) helps to clarify the differences.
Both increase the new difficult test average test score value of 20 counts to 30 counts on the reference scale. In this simple example based on a normal distribution, the remaining values increase in a uniform manner of equal units of 10 with the same SD and 15 when mapping to the larger SD.

The significance of this is that in the real world, test scores are not distributed in nice ideal normal distributions. The equating line can assume many shapes and slopes.

The unit of measure needed to plot an equating chart must include equivalent portions of the two distributions. Percentage is a convenient unit: equipercentile equating. [More on this in the next post.]

Whither Test A is the reference test, or Test B is the reference test, or both are combined as one analysis is the difficult subjective call of the psychometrician. So much depends on the luck on test day related to the test blueprint, the item writers, the reviewers, the field test results, the test maker, the test takers and many minor effects on each of these categories. 

This is little different from predicting the weather or the stock market, IMHO. [The highest final test scores at the Annapolis Naval Academy were during a storm with very high negative air ion concentrations.] The above factors also need to include the long list of excuses built into institutionalized education at all levels.

On a four-option item, chance alone injects an average 25% value (that can easily range from 15 to 35%) when students are forced to mark every item on a traditional multiple-choice (TMC) test. Quality is suppressed into quantity by only counting right marks: Quality and quantity are therefore linked into the same value. TMC high test scores have higher quality then lower test scores, but this is generally ignored.

It does not have to be that way. Both the partial credit Rasch model IRT and Knowledge and Judgment Scoring permit students to report what they trust they know and can do and what they have yet to learn accurately, honestly and fairly. No guessing is required. Both paper tests and CAT tests can accept, “I trust I know or can do this,” “I have yet to learn this,” and if good judgment does not prevail, “Sorry, I goofed.”  Just score 2, 1, and 0 rather than 1 for each right mark (for whatever reason or accident).

A test should encourage learning. The TMC at the lower scores is punitive. By scoring for both quantity and quality (knowledge and judgment) students receive separate scores, just as is done on most other assessments. “You did very well on what you reported (90% right) but you need to do more to keep up with the class” rather than “You failed again with a TMC score of 50%.

Classroom practice during the NCLB era tragically followed the style of the TMC standardized tests conducted at the lowest levels of thinking. The CCSS tests need to model rewarding students for their judgment as well as right marks. [We can expect the schools to again doggedly try to imitate.] It is student judgment that forms the basis for further learning at higher levels of thinking: one of the main goals of the CCSS movement. The CCSS movement needs to update its use of multiple-choice to be consistent with its goals.

Equating TMC meaninglessness does not improve the results. This crippled form of multiple-choice does not permit students to tell us what they really know and can do that is of value for further learning and instruction.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, July 3, 2013

Visual Education Statistics - Standardized Tests


                                                              17
Standardized test makers use statistics to predict what may happen; classroom statistics describe what has happened. Classroom tests include two or three dozen students. Standardized test making requires several hundred students. Classroom tests are given to find out what a student has yet to learn and what has been learned. Standardize tests are generally given to rank students based on a benchmark test sample. Classroom and standardized tests have other significant differences even though they may use many of the same items.

I took the two classroom charts (37 and 38 in a previous post) and extended the standard deviations (SD) from 5 - 10%, to 10 - 30%; a more realistic range for standardized tests (Chart 44). At a 70% average score and 20% SD the normal curve plots of 40 students by 40 items started going off scale. I then reversed the path back to the original average score of 50% as the SD rose from 20% to 30%.

The test reliability (KR20) continued to rise with the SD for these normal distributions set for maximum performance. The item discrimination (PBR) rose slightly. The relative SEM/SD value decreased (improved) from 0.350 to 0.157 as test reliability increased (improved).

The two tests with average test scores of 50% yielded very different test reliability and item discrimination values for SD values of 10% and 30% on Chart 44; the greater the distribution spread, the higher the KR20 and PBR values. [I plotted the N – 1 SD to show how close the visual education statistics engine (VESE) tables were to their expected normal curves.]

The SD is then a key indicator of test performance; the spread of the student score distribution, the main goal for standardized test makers. It is also very sensitive to extreme values. The 30% SD plot was made by teasing the VESE table that I set for 30% SD. The original SD value was near that for a perfect Guttman table (each student score and each item difficulty appear only once), about 28%. By moving four pair of marks, near the extreme ends of the distribution, one count more toward the end, the SD rose to 30%. That is moving four pair of marks out of 400 pair one count each to change the SD by 2%.

The standard error of measurement (SEM) under optimum normal test conditions remained about 4.4% (Chart 44). So, 4.4 x 3 = 13.2%. A difference in a student’s performance of more than 13.2% would be needed to accept the scores as representing a significant improvement with a test reliability of 0.95. All of the above mark patterns were not mixed; which is an unrealistically optimum performance.

I looked again at the effect of mixing right and wrong marks on an item mark pattern with a higher SD value than found in the classroom (Chart 45). The change from a SD of 10% to 20% was much smaller than I had anticipated. The effect of deeper mixing was again linear.

Average item difficulty sets limits on the maximum PBR that can be developed (Chart 46). In a perfect world where all items are marked either all right or all wrong, the maximum PBR is 1.0 for individual items.

Looking back at prior posts, I found lower values on a perfect Guttman table (0.84) and a normal curve table set at 30% SD (0.85). The PBR declined along with the SD set to 20% and 10% (Chart 46). 
These values hold for tests with average test scores that range from 50% to 70%.

There is now enough information to construct the playing field upon which psychometricians play (Chart 47).  I chose two scoring configurations: Perfect World and Normal Curve with a SD of 20%. The area in which standardized tests exit is a small part of the total area that describes classroom tests. The average student score and item difficulty were set at 50%.

An item mark pattern at 50% difficulty can produce a PBR of 1.0 in a perfect world (blue). All right marks are together and all wrong marks are together. The PBR drops to zero with complete mixing (Table 20). It falls to a -1.0 when all right marks are together at the lower end of the mark pattern.

The area for the normal curve distribution (red) with a SD of 20% fits inside the perfect world boundary. This entire area is available to describe classroom test items. Items that are easier or more difficult than 50% reduce the maximum possible PBR. They have shorter mark patterns. And here too, fully mixed patterns drop the PBR to zero.

We can now see the problem psychometricians face in making standardized tests. The standardized test area is about 1/8th of the classroom area. Standardized tests never use negative items (that almost excludes misconceptions which cannot be distinguished from difficult items using traditional multiple-choice scoring; as they can using Knowledge and Judgment Scoring).

Chart 44 indicates an average PBR of over 0.5 is need for the desired test reliability of over 0.95 under optimum conditions (no mark pattern mixing). With just ¼ mixing, the window for usable items becomes very small. The effect of mixing right and wrong marks on an item mark pattern varies with item difficulty. A test averaging 75% right with unmixed items would be the same as a test averaging 50% right with partially mixed items.

A 2008 paper from Pearson, by Tony D. Thompson, confirms this situation. “This variation, we argue, likely renders non-informational any vertical scale developed from conventional (non-adaptive) tests due to lack of score precision” (page 4). “Non-informational” means not useful, not valid, does not look right, and does not work, IMHO. “Conventional” means, in general, paper tests and the fixed form tests being developed by PARCC for online delivery for the Common Core State Standards (CCSS) movement.

This comment may be valid for “many educational tests” (page 14). “Also, if an individual’s observed growth is much larger than the associated CSEM, then we may be confident that the individual did experience growth in learning.” This indicates that using simulations within the playing field, as Thompson did, confirms my exploration of the limits of the playing field. [And the CSEM, which is applied to each score, is more precise than the SEM based on the average test score.]

“While a poorly constructed vertical scale clearly cannot be expected to yield useful scores, a well-defined vertical scale in and of itself does not guarantee that reported individual scores will be precise enough to be support meaningful decision-making” (page 28). This cautionary note was written in 2008, several years into the NCLB era.

The VESE tables indicate that the “best we can do” is not good enough to satisfy marketing department hype (claims). Testing companies are delivering what politicians are willing to pay for: a ranking of students, teachers, and administrators only based on a test producing scores of questionable precision. Additional use of these test scores is problematic.

An unbelievable situation is currently being challenged in court in Florida. Student test scores were used to “evaluate” a teacher who never had the students in class! It reveals the mind set of people using standardized test scores.  They clearly do not understand what is being measured and how it is being measured. [I hope I do by the end of this series.] Just because something has been captured in a number does not mean that the number controls that something.

Scoring all the data that can be in the answer sheets would provide the information (which is repeatedly sought but ignored in traditional multiple-choice) needed to guide student, teacher and administrator development. Schools designed for failure (“Who can guess the answer?”), fail. Schools designed for success have rapid, effective, feedback with student development (judgment) held as important as knowledge and skills. Judgment comes from understanding, a goal of the CCSS movement.

 - - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):