Wednesday, January 29, 2014

Test Scoring Myths for Students

The best test is a test that permits you to accurately, honestly, and fairly report what you know and can do. You know how to question, to get answers, and to verify. You know what you know and what you have yet to learn. This operates at two levels of thinking. It is a myth that a forced choice multiple-choice test measures what you trust you know and can do.

At the beginning of any learning operation, you learn to repeat and to recall. Next you learn to relate the bits you can repeat and recall. By the end of a learning operation you have assembled a web of skills and relationships. You start at lower levels of thinking and progress to higher levels of thinking. Practice takes you from slow conscious operations to fast automatic responses (multiplication or roller skating). It is a myth that learning primarily occurs only by responding to a teacher in a classroom.

Your attitude during learning and testing is important. Your maturity is indicated by your ability to get interested in new topics or activities your teacher recommends (during the course). As a rule of thumb, a positive attitude is worth about one letter grade on a test. It is a myth that you can easily learn when you have a negative attitude.

Your expectations are important. You tend to get what you expect. A nine year study with over 3000 students indicated that students tend to get the grade they expected at the time they enrolled in the class, based on their lack of information, misinformation, and attitude. It is a myth that you cannot do better than your preconceived grade.

Learning and testing are one coordinated event when you can see the result of your practicing directly (target practice or skateboarding). This situation also occurs when you are directly tutored by a person or by a person’s software. It is a myth that you must always take a test separately from learning.

Complex learning operations go though the same sequence of learning steps. The rule of three applies here. Read or practice from one source to get the basic terms or actions. Read or practice from a second set to add any additional terms or actions. Read or practice from a third set to test your understanding, your web of knowledge and skill relationships. It is a myth that you must always have another person test your learning (but another person can be very helpful).

That other person is usually a teacher who cannot teach and test each pupil or student individually. The teacher also selects what is to be learned rather than letting you make the choice. The teacher also selects the test you will take. It is a myth that your teachers have the qualities needed to introduce you to the range of skills and knowledge required for an honest, self-supporting citizen.

Teaching usually takes place during scheduled time periods. In extreme situations, only what is learned in those scheduled time periods will be scored. This is one basis for assessing teacher effectiveness. It is a myth that the primary goal of traditional schools is student learning and development.

Traditional multiple-choice is defective. It was crippled when the option of no response, “do not know”, was eliminated when adapted from its use with animal experiments to make classroom scoring easier. It is a myth that you should not have this option to permit accurate, honest, and fair assessment.
Traditional multiple-choice promotes selecting the best right answer: using the lowest levels of thinking. The minimum requirement is making a mark for each question. It is a myth that such a score measures what you know or can do. The score ranks you on the test.

The average test score describes the test, not you. (Table 15 or Download)
Your score may rank you above or below average. It is a myth that you will always be safe with an above average score (passing).

The normal distribution of multiple-choice test scores is based on your luck on test day. The normal distribution is desired for classes in schools designed for failure. It is a myth that a class should not have an average score of 90%.

Luck on test day will distribute 2/3 of your classmates’ multiple-choice scores within the bubble in the center of a normal distribution; that is one standard deviation (SD) from the average. (Table 15 or Download) [SD = SQRT(Variance) and the Variance = SUM(Deviation from the Average^2)/N = Mean Sum of Squares = MSS]

Your grade (cut score) is set by marking off the distribution of classmate scores in standard deviations: F (<-2 b="" c="" d="" to="">+1); A (>+2). Your raw score grade is the sum of what you know and can do, your luck on test day, and your set of classmates.

Raw scores can be adjusted by shifting their distribution, higher or lower, and by stretching (or shrinking) the distribution to get a distribution that “looks right”. It is a myth that your teacher, can only select the right mix of questions, to get a raw score distribution that “looks right”.

Some questions perform poorly. They can be deleted and a new, more accurate, scored distribution created. It is a myth that every question must be retained.

Discriminating questions are marked right only by high scoring classmates and marked wrong by low scoring classmates. (Table 15 or Download) It is a myth that all questions should be discriminating.

Discriminating questions produce your class raw score distribution. About 5 to 10 are needed to create the amount of error that yields a range of five letter grades. It is a myth that discriminating questions assess mastery.

The reliability (reproducibility, precision) of your raw score can be predicted, but not your final (adjusted) score. Test reliability (KR20) is based on the ratio of variation (the variance) from between student scores (external column) and within question difficulty mark patterns (internal columns). (Table 15 or Download)

This makes sense: The smaller the amount of error variance within the question difficulty internal columns, with respect to the variance between student scores in the external column, the greater the test reliability. Discriminating, difficult, questions spread out student scores more (yield higher variance) than they increase the error variance within the questions. If there were no error variance, a test would be totally reliable (KR20 = 1).  It is a myth that a good informative test must maximize reliability.

The test reliability can help predict the average test score your class would get if it were to take another test over the same set of skills and knowledge. The Standard Error of Measurement (SEM) of your test is the range of error (from all of the above effects) for the average test score. (Table 15 or Download) The SD of the test and the test reliability are combined to obtain the SEM. The test reliability extracts a portion of the SD. If the test reliability were 1 (totally reliable), the SEM would be 0 (no error), the class would be expected to get the same class test score on a retest.

And finally what can you expect about the precision of your score and your retest score (providing you have not learned any more). A retest is of critical importance to students needing to reach a high stakes cut score. If the SEM or CSEM ranges widely enough, you do not need to study. Just retake the test a couple of times and your luck on test day may get you a passing score. It is a myth that the probability, of you getting a passing grade 2/3 of the time, will insure you get the passing grade if you need a second trial.

The Conditional [on your raw score] Standard Error of Measurement (CSEM) extracts the variance from only your mark pattern (Table 22). [CSEM = SQRT(Variance within your marks X the number of questions] Your CSEM will be very small if you have a very high or low score. This limits the prospects of a passing score by retaking a test without studying.

Now to study, to change testing habits, or to trust to luck on test day, before a retest. Get a copy of the blueprint used in designing the test. A blueprint lists in detail what will be covered and the type of questions. Question each topic or skill. It is easier to answer questions other people have written if you have already created and answered your own questions. Use the advice in the first five paragraphs above and work up into higher level of thinking, meaning making (a web of relationships that makes sense to you and visualize, sketch, draw, every term).

A change in testing habits may also be in order. Many students who do not “test well” are bright, fast memorizers, but lacking in meaningful relationships that make sense to themselves. They are still learning for someone else: the test and scanning each question for the “one right answer”. With meaningful relationships in mind you have the information in hand to answer a number of related questions. You are not limited to just matching what you recall to the question answers. [Mark out wrong answers and guess from the remaining answers.]

And now for the “Hail Mary” approach. First, as a rule of thumb, your score on a test written by someone other than your teacher (a standardized test for example) will be one to two letter grades below your classroom test scores. If your failing test score is within 1 SEM of the cut score, you can expect a retest score within this range 2/3 of the time. The same prediction is made with your CSEM value that can range above and below the SEM value. If your failing test score is below 1 SEM or 1 CSEM from the cut score, you have no option other than to study. It is a myth that students passing a few points above the cut score will also pass on a retest. [Near passes are safe. Near failures are not.]

Also please keep in mind that all of the math dealing with the variation between and within columns and rows (the variance) can be done on the student and question mark patterns with no knowledge of the test questions or the students. It is a myth that good statistical procedures can improve poor question or student performance. Teacher and psychometrician judgment on the other hand can do wonders!

The standardized test paradox: A good blueprint to guide calibrated question selection for the test is the basis for low scores and a statistically reliable test. Good student preparation is the basis for high scores (mastery) and a statistically unreliable test (it cannot spread student scores out enough for the distribution to “look right”).

The sciences, engineering, and manufacturing use statistics to reduce error to a minimum (low maintenance cars, aircraft, computers, and telephones). Only in traditional institutionalized education (schools designed for failure) is error intentionally introduced to create a score range that “looks right” for setting grades and ranking schools. This is all non-sense for schools designed for mastery (who advance students after they are prepared for the next steps). It is a myth (and an entrenched excuse for failure by the school) that student score distributions must fit a normal, bell-shaped, curve of error.

Mastery schools are now being promoted as the burden of record keeping is easily computerized. The Internet makes mastery schools available everywhere and at anytime. This will have a marked change in traditional schooling in the next few years. This change can be seen in the “flipped” classroom (a modern version of assigned [deep] reading before class discussion). It is a myth that the “flipped” classroom is something new.

Current educational software removes the time lag, in the question-answer-and-verify learning cycle, introduced by grouping students in classes, and then extended with standardized tests. Learning and assessment are again joined to promote mastery of assigned skills and knowledge.  Students advance when they are ready to succeed at the next levels. It is a myth that “formative assessments” are actually functional when test results are not available in an operational time frame (seconds to a few days).

Standardized tests will continue to rank students and schools, as the tests mature to certifying mastery for students who learn and excel anywhere and at anytime. It is a myth that current substantive standardized tests (that do not let students report what they trust they know or can do) can “pin point exactly what a student knows and needs to learn”.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):



No comments:

Post a Comment