Wednesday, October 17, 2012

Your Standardized Consortium Test


Two consortia (PARCC and SBAC) are working again on tests that are beyond simple questions that can be answered at all levels of thinking. The questions will go through the usual calibration, equating, and bias free processes. And, to the best of my knowledge, they will continue to be right count scored, at the lowest levels of thinking.

Trying to assess 21st century skills (bicycling) with the same old tricycles (forced choice tests) seems rather strange to me. And more so when the test is to assess college and job preparedness. These tests are to do more than create a ranked scale on which a predetermined portion will pass or fail as has been used in past years. These tests are supposed to actually measure something about students rather than produce just a ranked performance on a test.

Trying to raise the level of thinking required on a test in the beginning of NCLB resulted in a lot of very clever questions. I have no idea if one could actually figure out why or how students answered the questions in respect to why they were on the test. On a forced choice test you just mark. On a quantity and quality scored test student responses fall into Expected, Guessing, Misconception, and Discriminating because students only mark when they trust they do know or can do – an accurate, honest and fair test is obtained with no forced guessing required.

Higher levels (orders) of thinking involve metacognition: the ability to think about one’s own thinking, the ability to question one’s own work, and the ability to be self-correcting. These abilities are assessed with quantity and quality scoring of multiple-choice tests. The quality score indicates the degree of success each student has in developing these abilities (when learning and when testing). The quantity score measures the degree of mastery of knowledge and related skills. It is not that they know the answer but that they have developed the sense of responsibility to function at all levels of thinking and can therefore figure out the answers.

May own experience has been that students learn metacognitive test taking skills quickly when routinely assessed with a test scored for knowledge and judgment. (Over 90% voluntarily switched from guessing at answers [traditional right count scoring] to quantity and quality scoring after two experiences with both). It took more than three to four times as long for them to apply these skills to learning: to reading or observing, with questions; to building a set of relationships that permitted them to verify that they could trust what they knew; to apply their knowledge and skill to answering questions they had not seen before.

The two consortia have to make a choice between beefing up traditional forced-choice multiple-choice tests or simply changing the test instructions so student can, continue with multiple-guess, or switch to reporting what they trust they know (quantity and quality scoring). I am not convinced that beefing up traditional forced-choice questions will produce the sought after results. The new questions must still be guarded against guessing as students are still forced to guess. The guessing problem is solved by letting students report what they trust they know using quantity and quality scoring – no guessing required.

Two sample items from SBAC show how attempts are being made to improve test items. A careful examination indicates that again, we are facing clever marketing.

“Which model below best represents the fraction 2/5?”

“Even if students don’t truly have a deep understanding of what two-fifths means, they are likely to choose Option B over the others because it looks like a more traditional way of representing fractions. Restructuring this problem into a multipart item offers a clearer sense of how deeply a student understands the concept of two-fifths.”

The word “best” is a red flag. Test instructions often read, “Mark the best answer for each item.” It means: Guess whenever you do not know, do not leave an item unmarked. Your test score is a combination of what you know and your luck on test day. Low ability students and test designers are well aware of this as they plan for each test.

“Best” in the item stem is also a traditional lazy way of asking a question. A better wording would be “is the simplest representation of”. There would then be just one right answer for the right reason: “the simplest representation” rather than “a more traditional way of representing”. Marketing. I agree that the item needs to be restructured or edited.

“For numbers 1a-1d, state whether or not each figure has 2/5 of its whole shaded.”

“This item is more complex because students now have to look at each part separately and decide whether two-fifths can take different forms. The total number of ways to respond to this item is 16. ‘Guessing’ the correct combination of responses is much less likely than for a traditional four-option selected-response item.”

The comment states that students must now “look at each part separately and decide” each of four yes/no answers. The item may be more complex to create with four answers but the answering is simpler for the student. Marketing.

Grouping four yes/no answers together to avoid the chance score of 50% is clever. The 2x2x2x2 (16) ways would become 3x3x3x3 (81) ways using quantity and quality scoring (if students were to mark at the lowest levels of thinking)! The catch here is that the possible ways and the probability of those ways are not the same thing. It is the functional ways, the number of ways that draw at least 5% of the marks that matters. If only four ways were functional on the test, then all of the above reduces down to a normal four-option item. Scoring the test for quantity and quality eliminates the entire issue as forced-guessing is not required when students have the opportunity to report what they trust accurately, honestly, and fairly. If you do not force students to guess, you do not need to protect test results from guessing.

As I understand how this item will be scored, it condenses four items into one for a reason that is not entirely valid: guessing control. The statement that “students now have to look at each part separately” is presented in such a way that it implies they would not “have to look at each part separately” on the first example. Marketing again. Since there is no way to predict how an item will perform, we need actual test data to support the claims being made.

These two examples are not unique in striving for higher levels of thinking assessment by combining two or more simple items into a more complex item. I dearly love the TAKS question that Maria Luisa Cesar included in her San Antonio Express-News article, 4 December 2011, 1B and 3B. Two simple questions along the line of, “Is this figure: A) a hexagon, B) an octagon, C) a square, D) a rectangle" have been combined.

I was faced with this question on my first day in school with, “Color each of the six circles with the correct color.” I did not know my colors. I had six circles and six crayons. I lined up the crayons of the left side of my desk. After coloring a bit of each circle with a crayon, I put it on the right side of my desk. I had colored each circle with the correct color.

The same reasoning would get a correct answer here without knowing anything about hexagons or octagons: the figures are not the same. That leaves 7 sides and 5 vertices. Seven sides is not correct. So 5 vertices must be correct, whatever a “vertice” is.

The STAAR question figures are composed of vertices (4, 6, 5, 6), faces (5, 6, 5, 4), and edges (5, 9, 8, 9). A simple count of each yields a match only with option C. No knowledge of the geometric figures is required at the lowest levels of thinking.

The problem here is that the question author was thinking like a normal adult teacher. It took me a couple of years using quantity and quality scoring (PUP) to make sense of the thinking students use when faced with a test question. I divided the sources of information that students used into two parts. One part is what students learned by just living: robins have red breasts and blue eggs. The other part is what they have learned in a reasoned, rational manner. These are roughly lower and higher levels of thinking, recall and formal, or passive and active learning.

On top of this is the human behavior of acting on what one believes rather than on what one knows. Here we are at the source of misconceptions that are very difficult to correct in most students and adults (teachers and teacher advocates have a pathological bias against free-enterprise when at the same time it generates the funds for their employment [and solves problems the educational bureaucracy fails to solve]. They also have an inability to relearn to use a multiple-choice test to assess what students actually know rather than using it to just rank students).

In summary, improving assessment by taking the old tricycle and adding dual wheels with deeper treat (multitasking and multiple part items) is really not enough. It is time to move on to the bicycle where the student is free to report what is trusted as the basis for further learning and instruction (spontaneous student judgment replaces that passive third wheel – waiting for the teacher to perform and correct). 

And even more important is to create the environment in which students acquire the sense of responsibility needed to learn at higher levels of thinking. Scoring classroom tests for knowledge and judgment (PUP and Partial Credit Rasch Model) does this: it promotes student development, as well as, knowledge and skill. Only when struggling students actually see and can believe they are receiving credit for knowing what they know rather than for their luck on test day, have I seem them change study and test taking habits.

Kaitlyn Steigler sums it up nicely in an article by Jane Roberts, “It used to be, I do, we do together, now you do.” “Now, the kids will take charge. The teaching will be based on what we figure they know or don’t know.” PUP scores multiple-choice tests both ways, so students can switch to reporting what they trust when they are ready. Then self-correcting students, as well as, their teachers will know what they know when they are learning, during the test, and as the basis for further learning.

No comments:

Post a Comment