## Wednesday, November 6, 2013

### The Value and Meaning of a Mark

The bet in the title of Catherine Gewertz’s article caught my attention: “One District’s Common-Core Bet: Results Are In”. As I read, I realized that the betting that takes place in traditional multiple-choice (TMC) was being given arbitrary valuations to justify the difference between a test score and a classroom observation. If the two agreed, that was good. If they did not agree, the standardized test score was dismissed.

TMC gives us the choice of a right mark and several wrong marks. Each is traditionally given a value of 1 or 0. This simplification, carried forward from paper and pencil days, hides the true value and the meanings that can be assigned to each mark.

The value and meaning of each mark changes with the degree of completion of the test and the ability of the student. Consider a test with one right answer and three wrong answers. This is now a popular number for standardized tests.

Consider a TMC test of 100 questions. The starting score is 25, on average. Every student knows this. Just mark an answer to each question. Look at the test and change a few marks, that you can trust you know, to right. With good luck on test day, get a score high enough to pass the test.
If a student marked 60 correctly, the final score is 60. But the quality of this passing score is also 60%.

Part of that 60% represents what a student knows and can do, and part is luck on test day. A passing score can be obtained by a student who knows or can do less than half of what the test is assessing; a quality below 50%. This is traditionally acceptable in the classroom. [TMC ignores quality. A right mark on a test with a score of 100 has the same value, but not the same meaning as a right mark on a test with a score of 50.]

A wrong mark can also be assigned different meanings. As a rule of thumb (based on the analysis of variance, ANOVA; a time honored method of data reduction), if fewer than five students mark a wrong answer to a question, the marks on the question can be ignored. If fewer that five students make the same wrong mark, the marks on that option can be ignored. This is why Power Up Plus (PUP) does not report statistics on wrong marks, but only on right marks. There is no need to clutter up the reports with potentially interesting, but useless and meaningless information.

PUP does include a fitness statistics not found in any other item analysis report that I have examined. This statistic shows how well the test fits student preparation. Students prepare for tests; but test makers also prepare for the abilities of test takers.

The fitness statistic estimates the score a student is expected to get if, on average, as many wrong options are eliminated as are non-functional on the test, before guessing; with NO KNOWLEDGE of the right answer. This is the best guess score. It is always higher than the design score of 25. The estimate ranged from 36% to 53%, with a mean of 44%, on the Nursing124 data.  Half of these students were self-correcting scholars. The test was then a checklist of how they were expected to perform.

With the above in mind, we can understand how a single wrong mark can be devastating to a test score. But a single wrong mark, not shared by the rest of the class can be taken seriously or ignored (just as a right mark, on a difficult question, by a low scoring student).

To make sense of TMC test results requires both a matrix of student marks and a distribution of marks for each question (Break Out Overview). Evaluating only an individual student report gives you no idea whither a student missed a survey question that every student was expected to answer correctly or a question that the class failed to understand.

Are we dealing with a misconception? Or a lack of performance related to different levels of thinking in class and on the test; or related to the limits of rote memory to match an answer option to a question? [“It’s the test-taking.”] When does a right mark also mean a right answer or just luck on test day? [“This guy scored advanced only because he had a lucky day.”]

Mikel Robinson, as an individual, failed the test by 1 point. Mikel Robinson, as one student in a group of students, may not have failed. [We don’t really know.] His score just fell on the low side of a statistical range (the conditional standard error of measurement; see a previous post on CSEM). Within this range, it is not possible to differentiate one student’s performance from another’s using current statistical methods and a TMC test design (students are not asked if they can use the question to report what they can trust they actually know or can do).

We can say, that if he retook the test, the probability of passing may be as high as 50%, or more, depending upon the reliability and other characteristics of the test. [And the probability of those who passed by 1 point, of then failing by one point on a repeat of the test, would be the same.]

These problems are minimized with accurate, honest, and fair Knowledge and Judgment Scoring (KJS). You can know when a right mark is a right answer using KJS or the partial credit Rasch model IRT scoring. You can know the extent of a student’s development: the quality score. And, perhaps more important, is that your students can trust what they know and can do too; during the test, as well as after the test. This is the foundation on which to build further long lasting learning. This is student empowerment.

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

FOR SALE: raschmodelaudit.blogspot.com/2013/10/knowledge-and-judgment-scoring-kjs-for.html

## Wednesday, October 30, 2013

### Growth Mindset

The article by Sarah D. Sparks, http://www.edweek.org/ew/articles/2013/09/11/03mindset_ep.h33.html?r=545317799, starts with a powerful concept: “It’s one thing to say all students can learn, but making them believe it – and do it – can require a 180-degree shift in student’s and teacher’s sense of themselves and of one another.”

The General Studies Remedial Biology course I taught faced this challenge. The course was scheduled at night for three consecutive hours in a 120-seat lecture room. I refused to teach the course until the following arrangements were made:
• The entire text was presented by cable online reading assignments in each dormitory room and by off-campus phone service.
• One hour was scheduled for my lecture, after any student presentations related to the scheduled topic.
• One hour was scheduled for written assessment every other week.
• One hour was scheduled for 10-minute student oral reports based on library research, actual research, or projects.

Students requested the assessment period be placed in the first hour instead of the second hour, after the first few semesters. This turned the course into a seminar for which students needed to prepare on their own before class.

Only Knowledge and Judgment Scoring (KJS) was used the first few semesters, with ready acceptance by the class. The policy of bussing in students from out of the Northwest Missouri region brought in protestors, “Why do we have to know what we know, when everywhere else on campus, we just mark, and the teacher tells us how many right marks we made?”

Offering both methods of scoring, traditional multiple-choice (TMC) and KJS, on the same test solved that problem. Students could select the method they felt most comfortable with; that matched their preparation the best.

The student presentations and reports were excellent models for the rest of the class. They showed the interest in the subject and the quality of work these students were doing to the entire class.
KJS provided the information needed to guide passive pupils alone the path to becoming self-correcting scholars. As a generality, that path took the shape of a backward J. First they made fewer wrong marks, next they studied more, and finally they switched from memorizing non-sense to making sense of each assignment.

Over time they learned they were now spending less time studying (reviewing everything) and getting better grades by making sense as they learned; they could actually build new learning on what they could trust they had learned. They could monitor their progress by checking their quality score and their quantity score. Get quality up, interest and motivation increase, and quantity follows.

The tradition of students comparing their score with that of the rest of the class to see if they were safe, or needed to study more, or had a higher grade than expected when enrolling in the course (and could take a vacation), was strong in the fall semester with the distraction of social groups, football and homecoming. The results of fall and spring semesters were always different.

There was one dismal failure. With the excellent monitoring of their progress in the course, the idea was advanced to recognize class scholars. These students, had in one combination or another of test scores and presentations, earned a class score that could not be changed by any further assessment. They had demonstrated their ability to make sense of biological literature (the main goal of the course, which, hopefully, would serve them well the rest of their lives, as well as, the habit of making sense of assignments in their other courses). The next semester all went as planned. Most continued in the class and some conducted study sessions for other students.

The following semester witnessed an outbreak of cheating. Today, Power Up Plus (PUP) gets its name by the original cheat checker added to Power UP. Cheating became manageable by the simple rule that any answer sheet that failed to pass the cheat checker would receive a score of zero. I offered to help any student who wished to protest the rule to the student disciplinary committee. No student ever protested.

[Cheating was handled in-class as any use of the university rules was not honored by the administration. You must catch individual students in the act. Computer cheat checkers had the same status as red light cameras do now. If more than one student is caught, the problem is with the instructor, not with the student. We cancelled the class scholar idea.]

We need effective tools to manage student “growth mindset”. The tools must be easy to use by students and faculty. Students need to see how other students succeed, to be comfortable in taking part, and be able to easily follow their progress when starting at the low end of academic preparation of knowledge, skills, and judgment (quality, the use of all levels of thinking).

A common thread runs through successful student empowerment programs: Effective instruction is based on what students actual know, can do, and want to do or to take part in. This requires frequent appropriate assessment at each academic level such as, in general, these recent examples:

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

FOR SALE: raschmodelaudit.blogspot.com/2013/10/knowledge-and-judgment-scoring-kjs-for.html

## Wednesday, October 23, 2013

### Alternative Multiple-Choice Origins

Two alternative forms of multiple-choice (AMC) to the traditional multiple-choice (TMC) developed from independent sources.  Geoff Masters from Melbourne, Australia, is credited as the developer of the parcel credit Rasch model (PMC), a form of Information Response Theory (IRT) analysis in 1982 (Bond and Fox).  It allows students to report what they know (2 points), what they do not know (1 point), and wrong answer (0 points). It never became popular on classroom or standardized tests.

The second form of AMC was developed at NWMSU. It started as net yield scoring (NYS) on both essay and multiple-choice. I needed a way to reduce the amount of reading required in scoring “blue book” essays. A 20-point essay started with 10 points. A point was added for acceptable, related, information bits. A point was subtracted for unacceptable, incorrect, unrelated information bits. An information bit was basically a short sentence with correct grammar and spelling. It could also be a relationship expressed as a diagram, sketch, or drawing.

This reduced the amount of reading by more than a 1/3 and improved student performance. Snow, filler, and fluff had no value but distracted a student from doing good work. Students needed to exercise good judgment in selecting what they wrote. This was no longer the case of their writing, and the teacher searching, for something that could earn them sufficient credit to pass the course; a lower level of thinking operation that is very common in high schools and colleges. NYS required students to use good judgment as well as be knowledgeable and be skilled.

This same idea was applied to computer scored multiple-choice tests with interesting results. When both TMC and NYS were offered on the same test, most students selected TMC on their first test. This is what they were familiar with. Over 90% of students elected NYS on their third test. Students also agreed that knowledge and judgment should have equal value.

By 1981 NYS was renamed knowledge and judgment scoring (KJS) to reflect what was being assessed: good judgment and a right answer (2 points), good judgment to report what has yet to be learned with no mark (1 point), and poor judgment, a wrong mark (0 points).

KJS requires and rewards students for using higher levels of thinking. The quality score is independent from the right count score. A struggling student with a test score of 60% may have also earned a quality score of 90%.

With TMC there is no way of knowing what a student with a score of 60% actually knows (when a right mark is a right answer or just luck on test day). With KJS we can know what this student knows with the same degree of accuracy as a student earning a 90% score on a TMC test.

More importantly, this reinforces the student’s sense of self-judgment and encourages effort to do better. It is the equivalent to the note a teacher marks on a special paragraph in an essay, “Good work!”

KJS provides the information needed to tell student and teacher what has been learned and what has yet to be learned in an easy to use report. Often a trail of bi-weekly test scores would follow a backward J. Reducing guessing by itself did not increase the test score but moved the score to a higher quality. Low quality students needed to change study habits. Low scoring high quality students needed to study more.

Learning by questioning and establishing relationships provided students the basis for answering question correctly that they had never seen before. They then stumbled onto what I meant by, “Make things meaningful (full of relationships) if your learning is to be really useful, empowering and easy to remember”. They did not have to review everything for each cumulative test.

The most interesting finding was that when students mastered meaning-making, they found themselves doing better in all of their courses. This is what inspired me to continue to promote Knowledge and Judgment Scoring. Students learn best when they are in charge. The quality score was the “feel good” score for struggling students until their improving development produced the high scores earned by successful self-correcting students.

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, October 16, 2013

### Knowledge and Judgment Scoring - Operational to Instructional

23
This post (and the next three) introduce why we need a KJS Group. The software, Power Up Plus (PUP), that contains both Knowledge and Judgment Scoring (KJS) and traditional multiple-choice (TMC) is now free to registered KJS Group members.  Version 5.22, is free to teachers and administrators. Please see instructions below.

This reflects a change in use of the software as an operational program for scoring individual classroom tests, to use as an instructional program to promote student and teacher development in preparation for the CCSS movement assessments. Students and teachers can readily see the difference between lower and higher levels of thinking when students are offered the opportunity to report, in a non-threatening environment, what they actually trust they know and can do, that serves as the basis for further learning and instruction. Practice riding the tricycle is poor preparation for a riding test on a bicycle.

Last week I finished a series of 22 posts on this Multiple-Choice Reborn blog. The series makes clear, that no amount of “statistical work” can extract from TMC marked answer sheets, some of the claims now being marketed about them. These tests can, at best, only do a good job of ranking students.

They so imperfectly and incompletely tell us what students know and can do that North Carolina is now spending six months figuring out how and where to place the cut scores on their new CCSS traditionally scored end-of-grade, multiple-choice math test results.

[They must guess where to put the cut score on the results from uncommitted, low scoring, improperly prepared students, who were guessing at the right answers to questions the test maker guessed, would produce a satisfactory score distribution, with high statistical reliability and precision. The more nonsensical the student mark data are, the more subjective the process.]

Accurate, honest, and fair testing can be done with Knowledge and Judgment Scoring and the partial credit Rasch model analysis. These methods allow students to report what they actually know and can do that is meaningful, useful, and empowering. Student development (the judgment to appropriately use all levels of thinking) is as important as knowledge and skills for successful students and employees (Knowledge Factor).

The NCLB decade has laid the foundation for real change by making schools designed for failure (that promote students beyond their abilities, rather than developing the necessary abilities for their success) so bad and so visible, that something had to be done. The CCSS movement has rekindled the old alternative (to TMC) testing and authentic testing methods; with the addition of CAT and elaborate assessment methods.

My concern now is that, after expending a large amount of time and money on promoting the CCSS movement ideals, a major part of the assessments will once again be reduced back again to traditional guess testing at the lowest levels of thinking.

Both KJS and TMC scoring can use the same test questions. In fact both methods are used on the same test to accommodate students working at all levels of thinking and with all degrees of preparation (PUP).

IMHO, KJS is a practical method of achieving the CCSS movement goals. It prepares students for  standardized tests presented at all levels of thinking.  [I still cannot predict when KJS or the partial credit Rash model will be used on standardized tests as current standardized tests are not designed to assess what students know or can do. They are designed, using the fewest questions, to produce an acceptable spread of student scores.]

Rather than a rank of 60 on a test, a student may get a quality score of 90% on questions used to report what the student actually knows and can do, as well as, a rank of right marks on the test using KJS. We now know what a “just passing” student knows with the same accuracy as a student earning a 90% score on a traditional test. This can be valuable formative assessment information.

Letting students tell us what they know or can do makes more sense than the guessing game now in use during preparation and assessment. And over 90% of my students preferred Knowledge and Judgment Scoring after just two experiences with it. Even students like an honest and fair test over gambling for a grade.

Past performance in my classroom is no guarantee of performance in your classroom unless you are a likeminded teacher, administrator, or test maker.

[The Educational Software Cooperative, Inc. (non-profit) closed this year (2013) after 20 years of operation during which I was the volunteer treasurer. It was founded to maximize the benefits of an individual computer: infinite patience, non-judgmental, and best of all, instant formative feedback. That level of instruction and record keeping has now been surpassed by the necessity for district wide record keeping systems operating online assessments keyed to CCSS learning objectives.]

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, October 9, 2013

### Multiple-Choice Test Analysis - Summary

22
The past 21 posts have explored how classroom and standardized tests are traditionally analyzed. The six most commonly used statistics are made fully transparent in Post 10, Table 15, the Visual Education Statistics Engine (VESE) [Free VESEngine.xlsm or VESEngine.xls]. One more statistic was added for current standardized tests. Numbers must be meaningful, understood; to have valid, practical value.

•       Count: The count is so obvious that it should not be a problem. But it is a problem in education.  Counting right marks is not the same as counting what a student knows or can do. Also a cut score is often set by selecting a point in a range from 0% to 100%. A cut score of 50 means 50%. But the test, when administered as traditional multiple-choice starts each student at 25% with 4-option questions. [There is no way to know what low scoring students know, only their rank.]

•       Average: Add up all of the individual student scores and divide by the number of students for the class or test average score. [There is no average student.] Classes or tests can be compared by their averages just as students can be compared by their counts or scores.

•         Standard Deviation (SD): Theoretically, 2/3 of the counts on a distribution of scores are expected to fall within one SD of the average. A very well prepared (or very under prepared) class will yield a small SD. A mixed class will yield a large SD with students with both very high and very low scores (many A-B and D-F, with few C grades).

•       Item Discrimination: A discriminating question groups those who know (high scoring students) into one group and those who do not know (low scoring students) into another group. Every classroom test needs about ten of these to produce a grade distribution where one SD is ten percentage points (a ten point range for each grade).

•       Test Reliability: A test has high reliability when the results are highly reproducible. Standardized tests, therefore, use only discriminating questions. They rarely ask a question that almost all students can answer correctly. Traditional multiple-choice, therefore, does not assess what students actually know and value. Traditional standardized tests can only rank students.

•       Standard Error of Measurement (SEM): Theoretically, 2/3 of the time a student retakes the same test; the scores are expected to fall within one SEM of the average. The SEM value fits inside the range of the SD. “Jimmy, you failed the test, but based on your test score and your luck on test day, each time you retake the test, you have a 20% expectation of passing without doing any more studying.” The SEM precision is based on the reliability of the entire test.

•       Conditional Standard Error of Measurement (CSEM): The CSEM is based (conditioned) on each test score. This refinement in precision is a recent addition to traditional multiple-choice analysis. It has been a part of the Rasch model IRT analysis for decades.

Even the CSEM cannot clean up the damage done by forcing students to mark every question even when they cannot read or do not understand the question. Knowledge and Judgment Scoring and the partial credit Rasch model do not have this flaw. Both accommodate students functioning at all levels of thinking and all levels of preparation.  These two scoring methods are in tune with the objectives of the CCSS movement.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, October 2, 2013

### Visual Education Statistics - Conditional Standard Error of Measurement

21

[[Second Pass, 8 July 2014.  Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).

Also the discussion of Table 24 has been updated to match the terms and values in Table 24.]]

Working on the conditional standard error of measurement (CSEM) is new territory for me. I always associated the CSEM with the Rasch model IRT analysis commonly used by state departments of education when scoring NCLB tests. I first had to Google for basic information.

If you are interested in the details, please check out these sources for sample (n-1) equations: (Equation 6.14 that corrects the relative variance was not included in the 2005 version of the current 2008 version. This represents a significant progress in applying test precision.)

•        Absolute Error Variance                 Equation 5.39 p. 73
•        Relative Error Variance                  Equation 6.3 p. 83
•        Corrected Relative Variance           Equation 6.14 p. 91 or GED Equation 3 p. 9

My first surprise was to find I had already calculated the CSEM for the Nursing124 data when I put up Post 5 of this series (in Table 8. Interactions with Columns [Items] Variance, MEAN SS = 3.33) as I discovered five ways to harvest the variance [mean sum of squares (MSS)]. Equation 6.3 n, Table 22, produces the same result (test SEM = 1.75) when it divides by n [unknown population] rather than n-1 [observed sample].

[n = the item count. Test SEM = AVERAGE(CSEM).]

I then used what I learned in the last post to table data to obtain the conditional error variance for student scores (Table 23a). The 21 items in Table 22 became the number of right marks on each of 11 item difficulties on Table 23a. The values in this tabulation were then converted into frequencies conditional on the student scores; the sum of which added to one, for each score (Table 23b).

The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]

The conditional frequency estimated test SEM was 1.68 (Table 23b). The conditional frequency CSEM values for each score were different for students with the same score. The CSEM values had to be averaged to get results comparable with the other analyses. These values generated an irregular curve, unlike the smooth curve for the other analyses (Chart 61). The conditional frequency CSEM analysis is sensitive to the number of items with the same difficult (yellow bars alternate for each change in value, Table 23b). The other analyses are not sensitive to item difficulty (yellow bars, in Table 22, include all students with the same score).

Complete curves were generated from Equation 6.3 for n-1 and for GED n-1 (Table 24). The GED n-1 analysis includes a correction factor (cf) for the range of item difficulties on the test [cf = (1- KR20)/(1-KR21)]. This factor is equal to one if all items are of equal difficulty. For the Nursing123 data it was 1.59; the difficulties ranged from 45% to 95%, from the middle of the total possible distribution to one extreme.

The CSEM values from the six analyses are listed in Table 24. Five are fairly close to one another. The GED n-1, with a correction for the range of item difficulties, is far different from the other five (Chart 61). Values could not be created for the full curve for conditional frequencies as you must actually have student marks to calculate conditional frequency CSEM values. The gray area shows the values calculated from an equation for which there were no actual data. Equations produce nice looking, “look right”, reports.

The CSEM improves the reportable precision on this test over using the test SEM. Good judgment (best practice) is to correct the CSEM values as done on the GED n-1 analysis.

[I did not transform the raw test score mean of 16.8 or 79.8% to a scale score of 50% as was done by Setzer, 2009, GED, p. 6 and Tables 2 and 3. The GED n-1 raw score cut point was 60% which is comparable to most classroom tests. If 25% of the score is from luck on test day that leaves 35% for what a student marked right as something known or could be done, as a worst case. If half of the lucky marks were also something the student knew or could do, the split would be about 10% for luck on test day and 50% for student ability.]

In Table 24, the GED n-1 analysis test SEM of 2.98 for the Nursing124 data is, as a range, 2.98/21 or 14.19%. For the uncorrected Equation 6.3 n-1 analysis, 1.79, the range is 1.79/21 or 8.52%. The n SEM was 1.75 or 7.95%. The n SEM range, 1.75, fits within the uncorrected n - 1 test SEM value, 1.79. The corrected GED n-1 test SEM value, 2.98, exceeds it.

Student score CSEM values are even more sensitive than the test SEM values. The maximum range for the GED n-1 analysis is 3.73 or 3.73/21 or 17.76% and for the Equation 6.3 n-1 analysis 2.35 or 11.19%. Both are beyond the maximum n CSEM value of 2.29 or  10.41%. This low quality set of data fails to qualify as a means of setting classroom grades or a standardized test cut score.

[However the classroom rule of 75% for passing the course and the rule for grades set at 10 percentage points over rule these statistics. Here is a good example that test statistics have meaning only in relation to how they are used. If the process of data reduction and reporting is not transparent, the resulting statistics are suspect and can produce extended debates over a passing score in the classroom.]

The CSEM for each student score does improve test precision. It can be calculated in several ways with close agreement. But it cannot improve the quality of the student marks on the answer sheets made under traditional, forced-choice, multiple-choice rules. These tests only rank students by the number of right marks. They do not ask students, or allow students to report, what they really know or can do; their judgment in using what they know or can do.

The CCSS movement is now promoting learning at higher levels of thinking (problem solving) with, from which I have learned, some de-emphasis  on lower levels of thinking that are the foundation for higher levels of thinking. A successful student cycles through all levels of thinking, as is needed. Yet half of the CCSS testing will be at the lowest levels of thinking, traditional multiple-choice scoring. The other half will be as much of an over kill as traditional multiple-choice is an under kill in assessing student knowledge, skills, and student development to learn and apply their abilities. Others have this same concern that centralized politics (and dollars) will continue to overshadow the reality of the classroom.

There is a middle ground that makes every question function at higher levels of thinking, allows students to report what is meaningful, of value, and empowering, and has the speed, low cost, and precision of traditional multiple-choice. Knowledge and Judgment Scoring and partial credit Rasch model IRT are two examples. They both accommodate students functioning at all levels of thinking. Lower ability students do not have to guess their way through a test. With routine use, both can turn passive pupils into self-correcting highly successful achievers in the classroom. If you are really into mastery learning, you can also try something like Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, September 25, 2013

### Visual Education Statistics - Frequency Estimation Equating

20
Frequency Estimation Equating involves conditioning on the anchor or a set of common items. This post reports my adventures in figuring out how this is done as I needed to know how to do this to complete the next post on conditional standard error of measurement (CSEM).

Two 24 student by 15 item tests were drawn from the Nursing124 data. Included in each was a set of 6 common items that were marked the same in both tests: A and B (Table 20). Student scores varied between Test A and Test B based on their marks on other, non-common, items. The common items were sorted by their difficulty.

I then followed the instructions in Livingston, (2004, pp. 49-51). The values in Table 20 were tabulated to produce “a row for each possible [student] score” and “a column for each possible score on the anchor [common item]” (Table 21). The tally is turned into frequencies conditioned on the common item scores by dividing each column cell by the common item score or difficulty. The frequencies for each common item sum to 1.00.

Next, the unknown population proportions are obtained by combining (multiplying) the common item frequencies with the equal portion each common item contributed (1/6) to the test (Table 21). These values now represent the on-average expectations for each cell based on the observed data. Summing by rows produces the estimated (best guess) unknown population student score distribution that could have also produced the on-average expectations. This was done for both Test A and Test B.

[This operation can be worked backward (in part) to yield the right mark tally. Dividing the population proportions by the number of items in the sample yields the right mark frequencies. Multiplying the right mark frequencies by the difficulty yields the right mark tally. But there is no way to back up from the estimated population distribution to this set of population proportions, let alone to individual student marks. The right mark tally is a property of the observed sample and of individual student marks. This estimated population distribution is a property of the unknowable population distribution related to the normal curve. The unknowable population distribution can spawn endless sets of population proportions. Monte Carlo psychometric experiments can be clean of the many factors that affect classroom and standardized test results.]

Charts 59 and 60 show the effect produced by conditioning on the common items. This transformation from observed to on-average expectations appears to rotate the distribution about the average test score of 84% and 80%, respectively, for both Test A and Test B. It made a detectable increase in the frequency of high scores and a similar decrease in the frequency of low scores. This increased the average scores to 86% and 84%, respectively. Is this an improvement or a distortion?

“And when we have estimated the score distributions on both the new form and the reference form, we can use those estimated distributions to do an equipercentile equating, as if we had actually observed the score distributions in the target population.” I carried this out, as in the previous post, with nothing of importance to report.

So far in this series I have found that data reduction from student marks to a finished product is independent from the content actually on the test. The practice of using several methods and then picking the one that “looks right” has been promoted. Here the creation of an unknown population distribution is created from observed sample results. Here we are also giving the choice of selecting Test A or Test B or combining the results. As the years pass, it appears that more subjectivity is tolerated in getting test results that “look right” when using traditional, non-IRT, multiple-choice scoring. This charge, formerly, was directed at the Rasch model IRT analysis.

It does not have to be that way. Knowledge and Judgment Scoring and partial credit Rash model IRT allow a student to report what is actually meaningful, useful, and empowering to learn and apply what has been learned. This property of multiple-choice is little appreciated.

What the traditional multiple-choice is delivering is also little understood (psychometricians guessing to what extent sample [actual test] results match an unknowable standard distribution population based on student marks that include forced student guessing on test items the test creators are guessing students will find equally difficult, as based on a field test, they guess will represent the current test takers, on average).

We still see people writing, “I thought this test was to tell us what [individual] students know.” Yet, traditional, forced-choice, multiple-choice can only rank students by their performance on the test. It does not ask them, or permit them, to individually report what they actually know or can do based on their own self-judgment: just mark every item (as a missing mark is still considered more degrading to an assessment than failing to assess student judgment).

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, July 24, 2013

### Visual Education Statistics - Equipercentile Equating

19
Equipercentile equating frequently appears in NCLB testing articles. I took a normal distribution of 40 student scores (average of 50%) with a standard deviation (SD) of 10% (new test) and equated it to one with a SD of 20% (reference test) to see how equipercentile equating works (Chart 54).
First I grouped the scores into 5%-ranges. I then matched the new test groups to the reference test groups (Chart 55). The result was a bit messy.
A re-plot of the twenty 5%-groups shows the new test has been sliced into groups that contain twice the count as the reference test, but which match, in general, the reference test every other group (Chart 56).
Smoothing by inspection resulted in Chart 57. A perfect fit was obtained with the reference test with the exception of rounding errors.
Smoothing on “small samples of test-takers” does make a difference in the accuracy of equipercentile equating. “The improvement that resulted from smoothing the distributions before equating was about the same as the improvement that resulted from doubling the number of test-takers in the samples” (Livingston, 2004, page 21). [See Post 13, Chart 34, in this series for the effect of doubling the number of test-takers on the SD and SEM.]

I then entered the values from Charts 54, 55, and 57 into my visual education statistics engine (VESE). Equipercentile equating the student scores transformed the new test into the reference test including the related group statistics (Chart 58).
The three 5%-groupings show almost identical values. Grouping reduced the item discrimination ability (PBR) of the reference test a small amount as grouping reduced the range of the student score distribution. This works very nicely in a perfect world, however, real test scores do not align perfectly with the normal curve.

A much more detailed description of equipercentile equating and smoothing is found in (Livingston, 2004, pages 17-24). The easy to follow illustrated examples include real test results and related problems, with a troubling resolution: “Often the choice of an equating method comes down to a question of what is believable, given what we know about the test and the population of test-takers.”

This highly subjective statement was acceptable in 2004. NCLB put pressure on psychometricians to do better. The CCSS movement has raised the bar again. The subjectivity expressed here is, IMHO, similar to that in using the Rasch model IRT analysis that has been popular with state departments of education. Both without IRT and with IRT methods base results on a relationship to an unknowable “population of test-takers”. Both methods pursue manipulations that end up with the results “looking right”.

[The classroom equivalent of this, practiced in Missouri prior to NCLB, was to divide the normal curve into parts for letter grades. One version was to assign grades to ranked student scores with uniform slices. True believers assigned a double portion to “C”. Every class was then a “normal” class with no way to know what the raw scores were or what students actually knew or could do.]

It does not have to be that way. Let students report what they actually know and can do. Let them report what they trust will be of value for further learning and for application in situations other than in which they learned. Do multiple-choice right. Get results comparable to essay, project, report, and research. Promote student development. Knowledge and Judgment Scoring and partial credit Rasch model analysis do this. Guessing is no longer needed. Forced guessing should not be tolerated, IMHO.

The move to performance based learning may, this time, not only compete with the CCSS movement assessments, but replace them. The system that is the leanest, the most versatile in meeting student needs, and is immune to erratic federal funding, and thus most effective, will survive.
- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

## Wednesday, July 10, 2013

### Visual Education Statistics - Equating

18
The past few posts have shown that if two tests have the same student score standard deviation (SD) they are easy to combine or link. Both tests will have the same student score distribution on the same scale.

Equating is then a process of finding the difference between the average test scores and applying this value to one of the two sets of test scores. Add the difference in average test score to the lower set of scores, or subtract it from the higher set to combine the two sets of test scores.

This can be done whenever the SDs are within acceptable limits (considering, all factors that may affect the test results, the expected results, and the intended use of the results). This is IMHO a very subjective judgment call to be made by the most experienced person available.

There are two other situations: same average test score but the different SDs are beyond acceptable limits, and both test score and SD differences are beyond acceptable limits for the two tests. In both cases we need to equate the two different SDs, the two different distributions of student scores.

Chart 48 is a re-tabling of Chart 44. The x-axis in Chart 48 shows the set Standard Deviation (SD) used in the VESE tables in prior posts. Equating a low SD test (10) to a high SD test (30) has different effects then equating a high SD test (30) to a low SD test (10). The first improves the test performance; the second reduces the test performance.

There is then a bias to raise the low SD test to the high SD test. “The test this year was more difficult than the test last year,” was the NCLB explanation from Texas, Arkansas, and New York. [It was not that the students this year were less prepared.]

The most frequent way I have seen mapping (Livingston, 2004, figure 2, page 14) done is to plot the scores of the test to be equated on the x-axis and the scores of the reference test on the y-axis. The equate line for two tests with similar average test scores and SDs is a straight line from zero through the 50% point on both axes (Chart 49).

If the average test scores are similar but the SDs are different, the equate line becomes tilted to expand (Chart 50) or contract (Chart 51) the equated values to match the reference test. Mapping from a low SD test to a higher SD tests leaves gaps. Mapping from a high SD test to a low SD tests produces clumping, in part, from rounding errors.

Mapping a new difficult test to an easier reference test with the same SD increases the values on the equating line, as well, as truncates it. Any new test scores over 30 on Chart 52 have no place to be plotted of the reference test scale.

The equating with an increase in both SD and average test score expands the distribution and truncates the equating line even more (Chart 52). A comparison of the two above situations as parallel lines (Chart 53) helps to clarify the differences.
Both increase the new difficult test average test score value of 20 counts to 30 counts on the reference scale. In this simple example based on a normal distribution, the remaining values increase in a uniform manner of equal units of 10 with the same SD and 15 when mapping to the larger SD.

The significance of this is that in the real world, test scores are not distributed in nice ideal normal distributions. The equating line can assume many shapes and slopes.

The unit of measure needed to plot an equating chart must include equivalent portions of the two distributions. Percentage is a convenient unit: equipercentile equating. [More on this in the next post.]

Whither Test A is the reference test, or Test B is the reference test, or both are combined as one analysis is the difficult subjective call of the psychometrician. So much depends on the luck on test day related to the test blueprint, the item writers, the reviewers, the field test results, the test maker, the test takers and many minor effects on each of these categories.

This is little different from predicting the weather or the stock market, IMHO. [The highest final test scores at the Annapolis Naval Academy were during a storm with very high negative air ion concentrations.] The above factors also need to include the long list of excuses built into institutionalized education at all levels.

On a four-option item, chance alone injects an average 25% value (that can easily range from 15 to 35%) when students are forced to mark every item on a traditional multiple-choice (TMC) test. Quality is suppressed into quantity by only counting right marks: Quality and quantity are therefore linked into the same value. TMC high test scores have higher quality then lower test scores, but this is generally ignored.

It does not have to be that way. Both the partial credit Rasch model IRT and Knowledge and Judgment Scoring permit students to report what they trust they know and can do and what they have yet to learn accurately, honestly and fairly. No guessing is required. Both paper tests and CAT tests can accept, “I trust I know or can do this,” “I have yet to learn this,” and if good judgment does not prevail, “Sorry, I goofed.”  Just score 2, 1, and 0 rather than 1 for each right mark (for whatever reason or accident).

A test should encourage learning. The TMC at the lower scores is punitive. By scoring for both quantity and quality (knowledge and judgment) students receive separate scores, just as is done on most other assessments. “You did very well on what you reported (90% right) but you need to do more to keep up with the class” rather than “You failed again with a TMC score of 50%.

Classroom practice during the NCLB era tragically followed the style of the TMC standardized tests conducted at the lowest levels of thinking. The CCSS tests need to model rewarding students for their judgment as well as right marks. [We can expect the schools to again doggedly try to imitate.] It is student judgment that forms the basis for further learning at higher levels of thinking: one of the main goals of the CCSS movement. The CCSS movement needs to update its use of multiple-choice to be consistent with its goals.

Equating TMC meaninglessness does not improve the results. This crippled form of multiple-choice does not permit students to tell us what they really know and can do that is of value for further learning and instruction.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):