Wednesday, November 28, 2012

A Balanced Common Core State Standards Assessment

It is time for psychometricians, teachers, and students to get on the same track with the same unit of measurement (not motorcycles, bicycles, and tricycles). Psychometricians have been top dog, feared, secretive and their judgment unquestioned. Teachers have worked hard, but to my current knowledge, only in a case like Nebraska has their judgment made a meaningful improvement in test results. Students have been treated as inanimate commercial commodities.

Optimum test results can only be obtained when the playing field is leveled for all three stakeholders. It is currently optimized from the view of psychometricians who have been strongly influenced, at times, by political power, and more often silenced by golden handcuffs. The “anomalies” that have become public and then retracted (more than once in Florida) show us the fruits of one-stakeholder rule in student performance assessment.

And now we have the Common Core State Standards tests. Students would like an honest, accurate and fair test. Teachers and students would like to know what each student knows and can do and what each one has yet to learn. Psychometricians would like highly reproducible test results, which do not require (present the opportunity for) equating test results (exposing error in selecting test items of equal difficulty) from year to year, but do present the appearance of equal difficulty.

And then we have the secondary level stakeholders who demand (and who fund with millions of dollars) the test results, only be in the form of a ranking, that shows improvement each year. They also want to do this at the lowest cost. To date the secondary level stakeholders have held the field.

Why things are as they are is then not too difficult to understand if you ignore the marketing that often overstates what is actually being done. Assessments carried out as forced activities cannot produce a valid indicator of what students actually know and can do. Such tests can produce a valid statistical ranking for satisfying a state or federal law. And that is why and how the tests have been funded.

The Common Core State Standards movement suggests that the judgment of all three primary stakeholders is included and respected. No one party is to triumph over or manipulate the other two parties. This demands some changes in the way they interact.

Students should be given the option of exercising their judgment in responding to test elements. This is inherent in classroom folders. It is also present when students have the option to respond to 5 essay items out of 7 to 10 suggested on a test. An in the alternative form of multiple-choice (quantity and quality scoring) students select questions to report, what in their judgment, they trust they know or can do.

Teachers should be given the option of exercising their judgment in writing test items that provide insight into what students are learning from what they are teaching. This includes both subject matter and skills, and student development. Teachers should be able to report, based on their judgment, which group each student best fits such as below, meets, and exceeds standards, as in Nebraska. Taken together, these inputs capture in numbers the climate of the classroom.

Psychometricians must respect the needs of the other two stakeholders. The oversimplification of data collection and data reduction to obtain the highest possible (but questionable) test reliability needs to become a part of the history of a natural experiment (NCLB) that has gone on too long. What works nicely in the safety of the research laboratory cannot be directly applied to individual student performances and obtain meaningful results (other than a ranking).

IMHO the Common Core State Standards movement demands the inclusion of more of the classroom climate (instruction, learning, feedback) than what forced test student performances yield. The student must be given the option to report what is meaningful, useful and empowering. The mechanics are simple for the student: know and don’t know; can or can’t do. Mark an option, select a question, or perform a task when in your mind you can trust what you are doing (and that this can be used as the basis for further learning and instruction). 

Students want to succeed. Teachers want them to succeed. Psychometricians need to capture what students and teachers have accomplished by letting students report knowledge, skills, and judgment. Quantity and quality scoring captures all three. Forced performances capture only part of knowledge and skills.

This has been a long introduction to three charts that summarize the psychometrician’s view of a standardized test. The first view is the result of over simplifying the classroom environment. Only right marks are counted on multiple-choice tests, or right stuff (generally restricted to rubrics) is counted on other forms of assessment. A raw score distribution is divided into three to five parts with cut scores. This is purely a statistical concept that works with any sample of anything. Once you have it in hand, the next job is to ascribe meaning to it based on each psychometrician’s judgment. The data from Alaska indicate that about 1/4 of the time students of equal abilities switch categories from year to year. This is a sizable measurement error related to right mark scoring.
The second view includes teacher judgment (see Nebraska posts). The single distribution is now teased apart into three. The average test score is no longer 50% but near 70%. The three score regions (below, meets, and exceeds standards) now have meaning based on teacher judgment (standard deviation of 20%, for example). 

The third view includes student judgment to report what is actually known and can be done that is the trusted basis for further learning and instruction. This is what the Common Core State Standards movement states is now needed. This chart is speculative. I have no actual data for it. I do know from working with over 3000 students that the portion of a test score distribution below 50% almost vanishes with quantity and quality scoring. Also the variation (the standard deviation) is lower, giving better separation of students grouped by performance (standard deviation of 10%, for example).

The psychometrician’s view is simple, cheap and often illusionary. The teacher’s view becomes more meaningful. The student’s view completes a balanced assessment system.

In summary, the Common Core State Standards movement now demands a far better test scoring and analysis than used in the past. In the case of multiple-choice tests, the switch from right count scoring to quantity and quality scoring only involves a change in test instructions that permit each student to elect which method should be used to score the test (see prior posts). The test then yields results that students, teachers and psychometricians can, all together, agree looks right.

Software to do this has been in existence for over two decades. Winsteps (partial credit Rasch model IRT) and Power Up Plus (Knowledge and Judgment Scoring) are two examples. Winsteps has been a popular program for state departments of education during the NCLB decade (they only need to change test instructions to assess student judgment).

Power Up Plus (PUP) is a classroom friendly program developed to provide students a means to frequently report accurately, honestly, and fairly what they actually knew and could do that was of value to themselves. They used the test results to guide further learning. I used the test results to guide my instruction and their development (passive pupil to self-correcting high achiever).

What all of this comes down to is an inversion of the present hierarchy:
  1. Let students have the opportunity to earn a quality score of 80-90% regardless of the quantity score. Let students report what they really know and can do.
  2. Let teachers submit questions that have shown in the classroom to meaningfully group students by their understanding, ability, skill, and development. These are questions that measure something important: mastery, misconceptions, reasoning errors and etc. Also let teachers estimate student test performance (below, meet, and above standards) as a part of each standardized test.
  3. Let psychometricians do their best with counts that are based on real students and classrooms rather than conducting an academic game show. The current statistical concept for ranking students is IMHO an even less perfect match to the Common Core State Standards movement than to the NCLB standards.
This is one way to produce a balanced assessment system. The standardized test items grow from all learning experiences. Students are free to make an accurate, honest, and fair report. Psychometricians are free to moderate a meaningful assessment process.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, November 21, 2012

Alaska Student Assessment Three Star Rating

The Alaska Reading Standards Based Assessments contain three features worthy of a star. In 2011, they show a matched comparison analysis that provides an insight into the dynamic nature of student assessment. In 2001, they also contain traditionally set cut scores and questions that are easy enough to provide actually measurement of what students know and can do.

ONE STAR: Alaska recorded the scores of students who obtain an increased, decreased or the same score (stable) this year as last year on the reading test for 2008-2009, 2009-2010 and 2010-2011 in a matched comparison analysis. The charts present static and dynamic views.

The portion of students in the Far Below Proficient and Below Proficient Stable group remained the same for all three comparisons. The portion of students in the Proficient and Advanced Stable group show a very small decline from year to year. The portion of students showing a decrease in performance matched the portion showing an increase in performance. This is a static view.

The dynamic view shows much more is going on in this assessment system. The reason the two above Stable views were stable is that about the same number of students who tested Below Proficient last year, this year tested Proficient (improved in proficiency), and the same number who tested Proficient last year, tested Below Proficient this year (decreased in proficiency).

This balanced exchange also took place between Proficient and Advanced levels of performance. In total, about 26% of all students changed proficiency levels each year (about 6% of the students crossed each of the two cut scores in both directions).

There are several reasons for this churning. The most obvious is variation in student preparation from year to year (any one set of questions will match one portion of the students better than the rest of the examinees). Another is how lucky each student was on test day. This brings up test design.

TWO STARS: The Alaska test compares student performance (norm-referenced). This is the most common and least expensive way to create a standardized test. It also forces students to mark answers even when they cannot read or understand the questions. This is called right count scoring, the traditional way of scoring classroom tests. It produces a score that can be used to validly rank student performance.

THREE STARS: The 2001 Alaska Technical Report, page 18, shows the average test scores for Reading ranged from 67% to 72% for grades 3, 6, and 8. Scores above 60% can indicate what students actually know and can do rather than their luck on test day. (The publication of average raw test scores is now considered essential to permit validation of the test results and comparison with other states using the same Common Core State Standards test.) [The Spring 2006 Alaska Standards Based Assessments, Chapter 8, did not list the average raw test scores: no star.]

SCORE VARIATION: The 2001 report, page 25, also shows the standard error of measurement (SEM), an estimate of where each student’s score would land on the cut score divided distribution, if the student could repeat the test. The example for Reading grade level 3 shows that 2/3rds of the time the repeated test scores of student “A” would fall within the range of 388 and 442 scale score units (415 original score +-27 SEM). That is 27/351 or 7.7% of the test mean, or 27/600 or 4.5% of the full-scale score. (The SEM is derived from the test reliability and the standard deviation in scale score units. A smaller, more desired, SEM can be produced by a higher test reliability and a lower standard deviation.)

The standard deviation, of the raw scores and the scale scores, provides a more direct view of the variation in the student test scores, page 18. The standard deviation is the sum of the deviations of each student score from the test mean, that is squared, and is then divided by the number of scores (variance) which is then returned to a normal number by obtaining the square root (squaring makes all the deviations positive values otherwise they would add up to zero).

The average standard deviation for the nine, grade 3, 6, and 8, test raw scores was 8.8/30.1 or 29% of the test means; that is, 2/3rds of the time a student with an average score of 30.1 would be expected to have repeated test scores fall between 30.1 +-8.8 or 21.3 to 38.9 on a test with 42 points total. Converting all of this into log ratio (logit) units used by psychometricians produces slightly different results.

The average standard deviation for the nine, grade 3, 6, and 8, test scale scores was 83/349 or 24% of the test means; that is 2/3rds of the time a student with an average scale score of 349 would be expected to have repeated scale scores fall between 349 +- 83 or 266 to 432 on a scale score range of 500 points (100 to 600).

Both SEM and standard deviations show a large amount of uncertainty in test scores. The documentation of this churning is worth a third star. This inherent variation in an attempt to capture student performance in a number accounts for much of the churning observed from year to year. Scoring these tests for quantity and quality instead of just counting right marks would yield much more useful information in line with the philosophy of the Common Core State Standards.

THREE OTHER STARS: Alaska places emphasis on cut scores on a single score distribution (norm-referenced). Nebraska (see previous post) places emphasis on two other score distributions (two stars): It groups scores both by asking questions needed to assess specific knowledge and skills (criterion-referenced) and by teacher judgment into which group each student they know well fits. Cut scores fall where a student score has an equal probability of falling into either group.

Both Alaska and Nebraska have yet to include student judgment in their assessments (one star). When that is done, Alaska will have an accurate, honest, and fair test that better matches the requirements of the Common Core State Standards.

Most right marks will also represent right answers instead of luck on test day and less churning of student performance rankings. The level of thinking used by students on the test and in the classroom can also be obtained. All that is needed is to give students the option to continue guessing or to report what they trust they know.

*   Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Including student judgment will add student development (the ability to use all levels of thinking) to the Alaska test. The Common Core State Standards needs students who know and can do, but also who have experienced judgment in applying knowledge and skills.

Routine use of quantity and quality scoring in the classroom promotes student develop. It promotes the sense of responsibility and reward needed to learn at all levels of thinking, a requirement of the Common Core State Standards.

Software to do quantity and quality scoring has been available for over two decades. Alaska is already using Winsteps. Winsteps contains the partial credit Rasch model routine that scores quantity and quality. 

Power Up Plus (PUP) scores multiple-choice tests by both methods: traditional right count scoring and Knowledge and Judgment Scoring. Students can elect which method they are most comfortable with in the classroom and in preparation for Alaska and Common Core State Standards standardized tests.

Starting in 2005, Knowledge Factor has a patented learning system that guarantees student development. High quality students generally pass standardized tests. All three programs promote the sense of responsibility and reward needed to learn at all levels of thinking, a stated requirement of the Common Core State Standards movement.

Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

Wednesday, November 14, 2012

Scoring Judgment and the Common Core State Standards

How student judgment is to be scored by Common Core State Standards assessments has yet to be finalized. How student judgment can be scored is related to time and cost. There is little additional cost when integrated into classroom instruction (in person or by way of software), as formative assessment, with an instant to one-day feedback. Weekly and biweekly classroom tests take additional time. Summative standardized tests take even more time.

Common Core State Standards tests will be summative standardized tests. The selection of questions for all types of tests is subjective. The easiest type of test to score is the multiple-choice or selected response test. All other types of tests require subjective scoring as well as subjective selection of items for the test.

The multiple-choice test is the least expensive to score. The traditional scoring by only counting right marks eliminates student judgment playing a part in the assessment. A simple change in the test instructions puts student judgment into the assessment where judgment can carry the same weight as knowing and doing.

*  Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Traditional right count scoring treats each student, each question, and each answer option with equal value. This simplifies the statistical manipulations of student marks. This is a common psychometric practice when you do not fully know what you are doing. It produces useable rankings based upon how groups of students perform on a test; which is something different from being based upon what individual students actually know or can do (what teachers and students need to know in the classroom).

This problem increases as the test score decreases. We have a fair idea of what a student knows with a test score of 75% (about 3/4 of the time a right mark is a right answer). At a test score of 50%, half of the right marks can be from luck on test day.

These two problems almost vanish when student judgment is included in the alternative multiple-choice assessment. Independent scores for knowledge and judgment (quantity and quality) indicate what a student knows and to what extent it can be trusted at every score level. This provides the same type of information as is traditionally associated with subjectively scored alternative assessments that all champion student judgment (short answer, essay, project, report, and folder).

Multiple-choice tests can cover all levels of thinking. They can be done in relatively short periods of time. They can be specifically targeted. Folders can cover long time periods and provide an appropriate amount of time for each activity (as can class projects and reports).

Standardized test exercises run into trouble when answering a question is so involved and time is so limited that the announced purpose of demonstrating creativity and innovation cannot take place in a normal way. My own experience with creativity and innovation is that it takes several days to years. These types of assessments IMHO then become a form of IQ test when students are forced to perform in a few hours.

Quantity and quality scoring can be applied to alternative assessments by counting information bits, in general, a simple sentence. It can also be a key relationship, sketch, diagram, or performance; any kernel of information or performance that makes sense. The scoring is as simple as when applied to multiple-choice.

Active scoring starts with one half of the value of the question (I generally used 10 points for essay questions which produced a range of zero to 20 points for an exercise taking about 10 minutes). Then add one point for each acceptable information bit. Subtract one point for each unacceptable information bit. Fluff, filler, and snow count zero. 

Quantity and quality scoring and rubrics merge when acceptable information bits become synonymous. Rubrics can place preconceived limits (unknown to the student) on what is to be counted. With both methods, possible responses that are not made count as zero. Possible responses that are made that are not included in a rubric are not counted, but are counted with quantity and quality scoring. In this way quantity and quality scoring is more responsive to creativity and innovation. The down side of quantity and quality scoring, applied to alternative assessments (other than to multiple-choice), is that it includes the same subjective judgment of a scorer working with rubrics.

Standardized multiple-choice tests have been over marketed for years. The first generation of alternative and authentic tests also failed. This gave rise to folders and the return of right mark scored multiple-choice. The current generation of Common Core State Standards alternative tests appears to again be over marketed. 

We want to capture in numbers what students know and can do and their ability to make use of that knowledge and skill. Learning and reporting on any good classroom assignment is an authentic learning academic exercise. The idea that only what goes on outside the classroom is authentic is IMHO a very misguided concept. It directs attention away from the very problems created by an emphasis on teaching rather than on meeting each student’s need to catch up, to succeed each day, and to contribute to the class.

The idea that only a standardized test can provide needed direction for instruction is also a misguided concept. It belittles teachers. It is currently impossible to perform as marketed unless carried out online. Feedback must be within the critical time that positive reinforcement is achieved. At lower levels of thinking that feedback must be in seconds. At higher levels of thinking, with high quality students, feedback that takes up to several days can still be effective.

Common Core State Standards assessments must include student judgment. They must meet the requirements imposed by student development. Multiple-choice (that is not forced choice, but really is multiple-choice, such as the partial credit Rasch model IRT and Knowledge and Judgment Scoring) and all the other alternative assessments include student judgment.

All students are familiar with multiple-choice scoring (count right, wrong and omit marks). Few students are aware of the rubrics created to improve the reliability of subjectively scored tests. This again leaves the multiple-choice test as the fairest form of quick and directed assessment when students can exercise their judgment in selecting questions to report what they trust they actually know and can do.

For me, it gave me a better sense of what the class and each student knew and could do (and as importantly, did not know and could not do) as reading 100 essay tests. Knowledge and Judgment Scoring does a superior job of highlighting misconceptions and grouping students by specific learning problems in classes over 20 students.

Wednesday, November 7, 2012

Student Judgment and the Common Core State Standards

The Common Core State Standards require students to take more responsibility for learning and for reporting what they actually know and can do. This takes practice. It is much like changing from riding a tricycle to a bicycle. You can ride both on the same course but the bicycle requires that you learn to balance. And once you learn to ride the bicycle you will never go back to the tricycle.

Most standardized assessments assume a linear increase in knowledge and skills. This is acceptable at lower levels of thinking (with training wheels). But at some point, sooner for some students and tasks and later for other students and tasks, there is a large leap from rote memory to understanding. Students see this as an escape from a boring and seemingly never ending task, of following the teacher, to the freedom of realizing there are limits in which things are related in such a way they make sense and give a feeling of completeness, of mastery, of empowerment.

            I know my A, B, Cs. 

I can get a 90 degree angle from any 3, 4, 5 unit triangle.

I can increase my energy level by eating a good breakfast and not stuffing before going to bed.

Many highly marketed products are not worth buying.

What I want now and what I need are not the same thing.

The difference between traditional classroom multiple-choice and cafeteria multiple-choice is I must mark each question on the test but I don’t have to eat one of each product on the counter.

Getting from one place to another can be very complicated but at each intersection I have just three options.
Nebraska in 2009 created standardized tests that reflect the judgment of classroom teachers (see prior posts). The tests currently do no reflect the judgment of students. The relationship of what part teacher and student judgment plays in determining raw scores on the Nebraska tests is estimated in the chart. The contribution each makes as the scores increase is not a linear event.

Teachers must do the heavy lifting until students develop that sense of responsibility needed to learn and report at higher levels of thinking. There is a shift from being dependent to being independent. This is a basic tenant of the Common Core State Standards movement.

[The opposing view is that children naturally have the behavior of learning like a scientist: observe, question, answers, verify. This is how they learn to walk and talk and mimic adult behavior. If this natural behavior had not been suppressed in their early school years, there would be no need to bring it back into play later, in my case, working with underprepared college freshmen who often behaved like passive middle school pupils.]

Once students have observed that they can succeed, that they are self-empowered, they still need teacher judgment to guide them. We are then dealing with a student who knows and can trust what is known as the basis for further learning and instruction. The student is now free to be innovative and creative, to, in my case, elect to take part in voluntary projects and oral reports to the class. These students modeled for others in the class how to be successful: make sense of assignments, and their own questions, rather than memorize nonsense.

[The difference between expected and observed results in the sciences and engineering is referred to as error. That same variation in arts and letters is referred to as being innovative and creative. And in medicine and social affairs, life/death and promotion/prison.]

Children want to learn to ride a bicycle, a skateboard, and to swim. Learning is scary but instantly rewarding. The Educational Software Cooperative, non-profit, was formed in 1994 to promote that same environment for students on computers. Now a more advanced environment, for the same software, is available on tablets and the Internet.  When they feel prepared they can report using software that measures both knowledge and judgment at all levels of thinking (Winsteps, partial credit Rasch model; and Power Up Plus, Knowledge and Judgment Scoring). A quantity and quality scored test can sort out which students are just repeating instruction and which students are reporting what they trust they know and can do.

In the future, I expect that assessment will be such an integrated function that it will be recorded as students learn at lower levels of thinking or in the classroom at all levels of thinking. Online courses are now doing this. The classroom of the future, IMHO, will still provide safe day care, teacher moderated group learning and assessment, and software learning and assessment for individual students. Equal emphasis will be placed on students learning and on their development as self-correcting, self-motivated high quality achievers. Success on Common Core State Standards tests will require such students.