Multiple-Choice Reborn: March 2013

Wednesday, March 27, 2013

Visual Education Statistics - Normal Curve

The count of right marks on a test is the raw material fed into statistical calculations. All right marks do not have the same value or meaning; though traditional multiple-choice (TMC) ignores this fact (see prior posts). The following model, operating in both a perfect world and also with real data, will do the same.

Able and inspired students and teachers see mastery as their goal. In this perfect world example, all of these students receive the same test score (85%). There is no variation in the scores.

Unable and uninspired students and teachers see passing as their goal. In this perfect world, all of these students receive the same test score (65%). There is no variation in the scores.

There is no need to invent statistics to describe test results if all students received the same test scores.

In a perfect world with 10 passing the test and 10 mastering the lesson, a new statistic appears (the mean or average of the 20 scores) of 75%. Each test score is 10 points away from (above or below) the class average score of 75%.

The passing score and the mastery score pass directly through the normal curve at the point the curve changes direction from flexing down from the mean to flexing up. These two points are called the standard deviation of the mean; most often shortened to standard deviation (SD). The curve can be described as 75% +- 10%. This standard measure is expected to contain 2/3 (68%) of the scores on a real student test.

Even though no student earned a score of 75%, this value represents the on-average score for the entire test. This model score distribution from 20 students, in no way, looks like the normal curve; the distribution expected when it includes random error.

Random error injects variation into test results. Let’s say one lucky student scored 90% right (an increase of 5%) instead of 85%. To keep the example balanced would require one unlucky student to score 60% right (a decrease of 5%) instead of 65%. This would stretch out the distribution (Chart 5).

But stretching increases the variation in the distribution. The increase in variation can be balanced by two students scoring 70% (an increase of 5%) instead of 65% and another two students scoring 80% (a decrease of 5%) instead of 85%.

[It takes moving two scores closer to the mean to balance one score further from the mean since the variation is expressed in squared values. Score counts change linearly from the mean, such as, 1, 2, 3, 4, 5, but the values for deviation from the mean change as squared values, such as, 1, 4, 9, 16, 25.

Squaring was resorted to so all values are positive, but it results in a distorted distribution. The distance between 2 and 4 is a difference of 2. The squared deviations from 4 to 16, vary a difference of 12.]

Doubling the amount of error (Chart 6) brings the score distribution closer to the normal distribution of error (the normal curve). Again the standard deviation remains 10. The distribution now looks more like traditional multiple-choice classroom test results. A bi-modal distribution was very common in my remedial biology class. The score distribution can be made to look even more like the normal curve by tweaking additional clusters of scores.

The normal curve does not describe the actual observed score distribution. The normal curve always views a distribution through the lens of three points: the mean, plus 1 SD and minus 1 SD.

A small SD means the distribution is short. A large SD means the distribution is more spread out.

The SD is never concerned with the location of your individual test score. Plus and minus 1 SD on the score scale is the region where about 2/3 of the test scores are expected to fall. There is no way to specifically predict where your score will actually fall, only the region in which it will fall. To find your test score, you must take the test.

The Nursing124 test data (Table 2) will now be used to apply the above concepts. In Chart 7, the normal curve includes 15 of the 22 scores within one SD of the mean. That is 2/3 or 68%, which is the same as the most expected value of 68%.

Each type of item (Mastery, Unfinished, and Discriminating) generates a normal curve (Chart 8) from the item’s score distribution. One student failed to get any right marks on the discriminating items, but two marked all of them right.

[I learned from Chart 8 that a calculated normal curve for discriminating items ignores the extreme values of 20% and 40% as well as zero percent and 100%, however these extreme values are the main contributors when calculating the SD in the next post. The actual distribution has been reduced to a numerical abstraction. I used the Excel function NORM.DIST that only refers to the mean and the SD.]

The total score normal curve is composed of the three (Mastery, 8 items; Unfinished, 8 items; and Discriminating, 5 items) sub-test normal curves (Chart 9). Every average score or mean can generate a normal curve. Visually the normal curve transmits more information in one view (subject to distortion by extreme values) about the raw score data than the average or the SD.

The uniqueness of each mark, student score, and item difficulty has now been reviewed. Unless some strongly biasing factor is involved, most factors are ignored using traditional multiple-choice (TMC). Provision is made in Break Out (Sheet 2), and PUP 5.22 (Table 2), to edit and rescore the test when an item is found to just be too bad to use or a spirited class discussion earns a point for everyone on the item. Otherwise, the only thing that counts using TMC is right and wrong: 1 and 0.

[PCM values counts as 0, 1, and 2 for wrong, judgment, and right counts. KJS values counts as 0, 0.5, and 1 for wrong, judgment, and right counts. Both scoring methods maintain the same value ratio for wrong, judgment and right counts. Both promote student development of high quality judgment.

TMC uses 0, 0.25, 1 for four-option items but this fact is hidden by forcing students to mark all items or accept a 0 for blank. This promotes guessing. Knowldge Factor uses 0, 0.75, and 1 which inverts the value for TMC judgment. This demands high quality judgment in high-risk occupations and in serious preparation for standardized tests.]

The normal curve can only be accurately drawn from large score distributions. It can be calculated for tests of any size, based on the test mean and SD.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):

Wednesday, March 20, 2013

Visual Education Statistics - Average

Statistic Two: The average or mean collects counts into one descriptive statistic. The percent test score (79.9%) is also the average percent student score and the average percent item difficulty (Table 3). It is not reversible. The average student score count is 16.77 and the average difficulty count is 17.57. You cannot redisplay the distribution of individual marks knowing only the average or mean score.

The prior post reviewed the simple row and column table used to record student marks on multiple-choice tests. It also pointed out that the test score is derived from marks that can have a wide range of values or meanings. The two that are easily extracted are knowledge (the number of right marks) and judgment (the percent of marks that are right using Knowledge and Judgment Scoring).

Table 3 shows that the test score is also the average of the three sub-tests, defined by item performance, found in most classroom tests: Mastery, Unfinished, and Discriminating (MUD).

Chart 1 shows the average scores of 93%, 73%, and 71% for the three sub-tests and 80% for the total test score.

The MUD distribution makes evident that a mark of 1 or 0 does not have the same value or meaning for every item in the test. Both student score marks and item difficulty marks (item scores) can report different things, all of which are ignored when just counting 1’s and 0’s with TMC.

A 1 on a Mastery item is a simple check list of what the class is expected to know or do. A 0 is not expected. [Mastery items are used to adjust the average test score.]

A 1 on a Discriminating item places that student in a group that knows or can do something that the group receiving zeros does not know or cannot do. There is a group of students in this class that needs to catch up with the rest of the class. Grouping helps identify instructional problems. [Discriminating items produce the score distribution needed to assign grades.]

Unfinished items indicate a failure in instruction, learning and/or testing. Having almost all 1’s in both Mastery and Discriminating items identifies a student functioning at higher levels of thinking.

[A fourth group, Misconceptions, identifies items that mystify students. They believe they can use the item to report something they know or can do, but are in error. Misconceptions are only identified when students elect to use KJS. If a minority of the class, less than average, marks an answer and most of the answers are wrong, it is guessing. When a majority of the class, more than average, marks an answer and most of the answers are wrong, it is a misconception.]

Students are rightly interested in their score location in the score distribution, above or below average (I am safe or I need to study more). Classroom teachers are interested in an expected average test score from which they can assign grades.

The process of data reduction from mark distribution to sub-test to total test is not reversible. From this point on we are dealing with averages (means) of groups, not with individual marks. Individual test scores and individual item difficulties are traditionally treated as single entities, but are really the average number of right marks for each case. All of these statistics work the same for TMC, KJS, and PCM scoring.

Neither Table 3 nor Chart 1 (the average or mean) does a good job of capturing the mark distribution in a number. That requires the next statistic: the standard deviation of the mean; usually shortened to standard deviation.

- - - - - - - - - - - - - - - - - - - - -

Help for you and your students to experience and to understand the change from TMC to KJS (tricycle to bicycle):

Wednesday, March 13, 2013

Visual Education Statistics - Count

Two recent news items highlight the problems produced by faulty communication in the assessment, education and political communities. A test may be inadequate to deliver the requested information resulting in two different scenarios: The test is used in the state of Washington. A test is not used but replaced with a Projection Measure that “was so silly that it was killed” after a brief use in Texas.

Large amounts of money are involved in such exercises. A satisfied customer in this area must be able to understand the limits of what is being purchased. We do not want to show up for dinner at 7:00pm to find it was served at 12:00 noon. Dinner and lunch can refer to the same thing and to different things. It depends upon the culture.

Psychometricans have been lax in communicating what they do in an understandable form to the cultures that finance them and to those who attempt to make valid use of their work. During 50 years of experience, I have not found a unified expression of common education statistics or a way of accomplishing that feat that is meaningful and therefore useful. The personal computer, the interactive spreadsheet, and the Internet should now make this possible.

This set of posts is designed so that anyone interested in the topic of multiple-choice testing can see inside of six commonly used education statistics. The series will also include Excel what-if engines to animate them. You only understand after you have experienced. It is only when several statistics are combined that the interactions and limits become visible. Combining statistics interactively also simplifies the naming of variables as only one name is needed where several may be used otherwise.

I will attempt to produce an understandable graphic for each of six common education statistics that I have encountered being used with traditional multiple-choice tests (TMC):

count
average or mean
standard deviation of the mean or the spread of the distribution of scores
test reliability or the ability to reproduce the same scores
standard error of measurement or the range in which a student’s score may fall
item discrimination or the ability of a question to group students into one group that knows (and is lucky) and one group that does not know (and is unlucky).

If you are comfortable with traditional education statistics, you may want to skip to the first spreadsheet: Test Reliability Engine. If you are interested in the findings summary of this audit, skip to [to be posted]. If you are interested in the details as I work through this project, please read on.

Your comments will be appreciated, especially errors and omissions (corrections are easily made on a blog). I want the facts to be readily seen and understood rather than you relying on me as one more authority (“trust me”, from Jungle Book, and any number of commercial, education and political organizations).

Please practice with your students using Break Out (free) to learn to understand the difference between traditional multiple-choice (TMC) and Knowledge and Judgment Scoring (KJS). The Common Core State Standards (CCSS) movement demands that passive pupils become engaged active self-correcting high quality achievers.

The student mark data from the Nursing124.ANS file contains the right marks by 22 students on 21 questions. Extreme scores and difficulties (100%) were eliminated from the 24 by 24 matrix when I was working on my audit of the Rasch model.

Statistic One: Right mark counts yield student scores (rows) and item difficulties (columns). The value of each student score mark (1 or 0) is not affected by item difficulty or the level of thinking used in making the mark. The value of each item difficulty mark or item score (1 or 0) is not affected by student score or student ability. A right mark is a right mark (1). The more right marks you get, the better, is meaningful to everyone using traditional multiple-choice (TMC).

[The above remarks are prompted from my audit of the Rasch IRT model. The claim (see Number of IRT Parameters) is made that student abilities are independent from item difficulties and item difficulties are independent from student abilities using the one-parameter IRT model. I am willing to believe that theory but I have yet to see it. I do not know or understand it based only on how estimates of student ability and item difficulty are made.]

Counts are typically listed in a mark, or item score, table. Student scores are entered at the end of rows. Item difficulties are listed at the lower end of columns. This looks very clean and simple (1 and 0), especially when compared to what is being attempted to be measured. A mark of 1 or 0 may result from many factors that are related to the item, or to the student, or to factors indirectly related to the test environment (race, religion, parenting, etc.).

A good analogy is a test plot of corn kernels from several ears of corn (rows) placed in several types of soil (columns). The scoring is based on the seedlings. Several factors can be scored: color; development of leaf, stem, and roots; size of plant, stem and root; sturdiness; and etc. But in education, with traditional multiple-choice (TMC), there would be but two scores: 1 for a seedling, and 0 for none. A 1 would be recorded for both a corn seedling and a weed seedling. A weed corresponds to good luck in marking a right answer. All the other factors that influence student marks are ignored.

Even in Table 2 all right answers have been replaced with a single symbol to make the chart easier to view. That symbol will become a 1 using TMC. Each wrong mark, regardless of the answer option, will become a 0.

But one factor, other than right/wrong, can be obtained directly from the answer sheets. That factor is student judgment. Student judgment is as important as knowing and doing, in moving students from lower to higher levels of thinking. The CCSS movement demands the development of student judgment.

Counting right marks is simple. However, each mark is not reporting the exact same thing. Forcing students to mark “the best answer” and counting right marks produces a quantitative score locked to a qualitative score (that is why only one score is reported using TMC, as the two scores are identical). That deficiency is easily corrected by the Rasch IRT partial credit model (PCM) or by Knowledge and Judgment Scoring (KJS).

KJS yields independent scores of quantity (1 or 0) and scores of quality (scoring a student’s judgment to report what is actually known or can be done, that is the basis for further learning and instruction). Weeds can be differentiated from corn.

With KJS both teachers and high quality students know what is known and can be done during the test as well as afterwards. By scoring for knowledge and judgment (quantity and quality) we can reduce the weeds in the corn. We can identify and correct misconceptions. Instruction can be more effective.

The most important thing that can be said at this point is that what you count and how you count determines the value of everything that follows. TMC, with right mark scoring, extracts the lease amount of information with the least value from a multiple-choice test. You get the least return for the time and money invested: a ranking.

Tradition seems to be the main reason TMC is still used. KJS and the PCM both shift the responsibility for learning and reporting from the teacher to the student. This shift is now a key element in the Common Core State Standards (CCSS) movement. It promotes the change from a classroom of passive followers to an active classroom of self-correcting high quality successful achievers. Assessing judgment may now become acceptable, and even required, when using multiple-choice tests (as it is in most other assessments).

Students like to be free to report what they trust they know and can do. But this must be experienced to be understood, appreciated, and accepted. After two tests, over 90% of my 3,000 students switched from guessing at answers on a multiple-choice test, to using it to report what they trusted they knew or could do. Teachers also need to experience before they understand (scoring judgment with multiple-choice tests is still a new professional development topic).

The CCSS movement demands doing, not talking and listening. To make the most of this series of posts, download Break Out. (It is in entirely free open source code.) Use it to help break out of an old antiquated failing tradition that emphasizes one right answer instead of the CCSS requirement of developing the ability and mindset to apply what is known to a range of questions or tasks.

Multiple-Choice Reborn

Followers

Blog Archive

About Me