Multiple-Choice Reborn

Student-Centered Learning

2018-05-11T07:03:00.000-07:00

I have spent over 40 years preaching the need for schools designed for success rather than for failure. Yesterday I happened upon an article by Nicholas Donohue that presents convincing evidence that that is being done by transforming high schools in the New England states. It is call student-centered learning.

My attempt in 1981-1989 used a campus computer system at NWMSU, textbook, lecture, laboratory, AND voluntary student presentations, research, and projects. This work has been further developed in Multiple-Choice Reborn and summarized in Knowledge and Judgment Scoring - 2016. In 1995, Knowledge Factor patented an online confidence based learning system (now in amplifier). Masters, 1982, developed Rasch partial credit scoring (PCS).

All three put the student in the position of being in charge of learning and reporting; at all levels of thinking. They approached evaluating an apple from the skin, as traditional multiple-choice (guess) testing is done.

PCS just polished the apple skin. The emphasis was still on the surface, the score, at that time. Knowledge Factor made the transition from the concrete level of thinking to understanding (skin to core), and provided the meat between in amplifier. Nuclear power plant operators and doctors were held to a much higher responsibility (self-judgment) standard (far over 75%, over 90% mastery) than is customary in a traditional high school classroom (60% for passing).

My students voted to give knowledge and judgment equal value (1:1 or 50%:50%). Voluntary activities replaced one letter grade (10% each). The students were then responsible for reporting what they knew or could do. They could mix several ways of learning and reporting.

A student with a knowledge score of 50% and a quality score of 100% would end up with about the same test score as a student who marked every question (guessed) for a quality, quantity, and test score of 75% (with no judgment).

These two students are very different. One is at the core of being educated (scholar). The other is only viewing the skin (tourist). The first one has a solid basis for self-instruction and further learning; is ready for independent scholarship. The apple seeds germinate (raise new questions) and produce more fruit (without the tree).

We know much less about the second student, and about what must be “re-taught”. The apple may just be left on the tree in what is often a vain effort to ripen it. Such is the fate of students in schools designed for failure (grades A to F).

In extreme cases, courses are classified by difficulty or assigned PASS/FAIL grades. My General Biology students were even “protected” so I could not know which student was in the course for a grade or pass/fail.

Students assess the level of thinking required in a course by asking on the first day, “Are your tests cumulative?” If so, they leave. This is a voluntary choice to stay at the lowest levels of thinking. Memory care residents do not have that choice.

There is a frightening parallel between creating a happy environment for memory care residents here at Provision Living at Columbia, and creating an academic environment (national, state, school, and classroom) that yields a happy student course grade. Both end up at the end of the day pretty much where they started, at the lowest levels of thinking.

Many students made the transition from memorizing nonsense for the next test to questioning, answering, and verifying; learning for themselves and knowing they were “right”. This is self-empowering. They started getting better grades in all of their courses. They had experienced the joy of scholarship, an intrinsic reward. “I do know what I know.” The independent quality score in knowledge and judgment scoring directed their path.

Student centered learning is not new. The title is. This is important in marketing to institutionalized education. What is new is that at last entire high schools are now being transformed for the right reason: student development rather than standardized test scores based on lower levels of thinking instruction and testing.

These students should be ready for college or other post high school programs. They should not be the under-prepared college students we worked with. The General Biology course was to last for only a few years; until the high schools did all of this work. In practice, the course became permanent. Biology did not became a required course in all high schools.

My interest in this project was to find a way to know what each student really knew, believed, could do, and was interested in, when a new science building was constructed in 1980 with 120 seat lecture halls. The unexpected consequence of promoting student development, based on the independent quality and quantity scores, was not only a bonus but appropriately needed for under-prepared college students. Over 90% of students voluntarily switched from guessing right answers to reporting what they actually knew and could do.

Information and Reliability

2015-05-13T03:00:00.000-07:00

#15

How does IRT information replace CTT reliability? Can this be found on the audit tool (Table 45)?

This post relates my audit tool, Table 45, Comparison of Conditional Error of Measurement between Normal [CTT] Classroom Calculation and the IRT Model to a quote from Wikipedia (Information). I am confident that the math is correct. I need to clarify the concepts for which the math is making estimates.

Table 45

“One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance.”

See test reliability (a ratio), KR20, True/Total Variance, 0.29 (Table 45a).

“This index is helpful in characterizing a test’s average reliability, for example in order to compare two tests.”

The test reliability for CTT and IRT are also comparable on Table 45a and 45c, 0.29 and 0.27.

“But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test’s range, for example, generally have more error associated with them than scores closer to the middle of the range.”

Table 46

Chart 82

See Table 45c (classroom data) and Table 46, col 9-10 (dummy data). For CTT the values are inverted (Chart 82, classroom data and Chart 89, dummy data).

Chart 89

“Item response theory advances the concept of item and test information. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response multiplied by the probability of an incorrect response, or, . . .” [I = pq].

See Table 45c, p*q CELL INFORMATION (classroom data). Also on Chart 89, the cell variance (CTT) and cell information (IRT) have identical values (0.15) from Excel =VAR.P and from pq (Table 46, col 7, dummy data).

“The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the . . .” [1/SQRT(pq)].

Is the “test information … at a given trait level” the Score Information (3.24, red, Chart 89, dummy data) for 17 right out of 21 items? Then the reciprocal of 3.24 is 0.31, the error variance (green, Chart 89 and Table 46, col 9) in measures on a logit scale. And the IRT conditional error of estimation (SE) would be the square root: SQRT(0.31) = 0.56 in measures. And this inverted would yield the CTT CSEM: 1/0.56 = 1.80 in counts.

[[Or the SQRT(SUM(p*q)) = SQRT((0.15) * 21) = SQRT(3.24) = 1.80 (in counts) and the reciprocal is 1/1.80 = 0.56 in measures.]]

The IRT (CSEM) in Chart 89 is really the IRT standard error of estimation (SE or SEE). On Table 45c, the CSEM (SQRT) is also the SE (conditional error of estimation) obtained from the square root of the error variance for that ability level (17 right, 1.73 measures, or 0.81 or 81%).

“Thus more information implies less error of measurement.”

See Table 45c, CSEM, green, and Table 46, col 9-10.

“In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range.”

Chart 92

Table 47

The same generality applies to the item information functions (IIF)s in Chart 92 but it is not very evident. The item with a difficulty of 10 (IIF = 1.80, Table 47) is also highly discriminating. The two easiest items had negative discrimination; they show an increase in information as student ability decreases toward zero measure. The generality applies best near the average test raw score of 50% or zero measure; which is not on the chart (no student got a score of 50% on this test).

This test had an average test score of 80%. This has spread the item information function curves out (Chart 92). They are not centered on the raw score of 50% or the measures zero location. However each peaks near the point where item difficulty in measures is close to student difficulty in measures. This observation is critical in establishing the value of IRT item analysis and how it is used. This makes sense in measures (a natural log of the ratio of right and wrong mark scale) but not in raw

scores (normal linear scale) as I first posted in Chart 75 with only count and percent scales.

“Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range.”

This is very evident in Table 47 and Chart 92.

“Because of local independence, item information functions are additive.”

See Test SEM (in Measures), Winsteps Table 17.1 MODEL S.E. MEAN (identical) = 0.64, Table 45c)

“Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.”

“Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification.”

At this point my audit tool, Table 45, falls silent. These two mathematical models are a means for only estimating theoretical values; they are not the theoretical values nor are they the reasoning behind them. CTT starts from observed values and projects into the general environment. IRT can start with the perfect Rasch model and select observations that fit the model. The two models are looking in opposite directions. CTT uses a linear scale with the origin at zero counts. IRT sets its log ratio point-of-origin (zero) at the 50% CTT point. I must accept the concept that CTT is a simplification of IRT on the basis of authority at this point.

“In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, [student ability].”

I would word this, “In ADDITION to reliability,” (Table 45a, CTT = 0.29 and 45c, IRT = 0.27). Also the “IRT offers the ITEM information function which shows the degree of precision at different values . . .”

“These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chose items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single “cutscore,” and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cutscore. These items generally correspond to items whose difficulty is about the same as that of the cutscore.”

The eleven items in Table 47 and Chart 92 each peak near the point where item difficulty in measures is close to student difficulty in measures. The discovery or invention of this relationship is the key advantage of IRT over CTT.

These data show that a test item need not have to have (a commonly recommended) average score near 50% for useable results. Any cutscore from 50% to 80% would produce useable results on this test with an average score of 80% and cutscore (passing) of 70%.

"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."

My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location).

I do not know what part of IRT is invention and what part is discovery on the part of some ingenious people. Two basic parts had to be fit together: information and measures by way of an inversion. Then a story had to be created to market the finished product; the Rasch model and Winsteps (full and partial credit) are the limit of my experience. The unfortunate name choice of “partial credit” rather than knowledge or skill and judgment may have been a factor in the Rasch partial credit model not becoming popular. The name, partial credit, falls into the realm of psychometrician tools. The name, Knowledge and Judgment, falls into the realm of classroom tools needed to guide the development of scholars as well as obtain maximum information from paper standardized tests; where students individually customized their tests (accurately, honestly, and fairly) rather than CAT where the test is tailored to fit the student; using best-guess, dated, and questionable second hand information.

IRT makes CAT possible. Please see "Adaptive Testing Evolves to Assess Common-Core Skills" for current marketing, use, and a list of comments, including two of mine. The exaggerated claims of test makers to assess and promote deveoping students by the continued use of forced-choice lower level of thinking tests continues to be ignored in the marketing of these tests to assess Common Core skills. Increased precision of nonsense still takes precedence over an assessment that is compatible with and supports the classroom and scholarship.

Serious mastery: Knowledge Factor.
Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).
Ranking: Forced-choice on paper or CAT.

CTT and Rasch IRT Item Analysis Paradox

2015-04-08T03:00:00.000-07:00

[The solution is in Chart 89, Item Analysis flow sheet.]

An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision” in “Reliability and separation of measures.” A more complete discussion is given under the title, “Standard Errors and Reliabilities: Rasch and Raw Score”.

Chart 82

The apparent paradox is graphed in Chart 82. Precision on one scale is the inverse or reciprocal of the other: 1/0.44 = 2.27 and 1/2.27 = 0.44.

Table 45

I edited Table 32 to disclose a full development of a comparison between CTT and IRT using real classroom data (Table 45). This first view is too complicated.

Chart 83

Chart 83 (CTT) and Chart 84 (IRT) summarize the statistics behind Table 45.

Chart 84

Table 45 includes the process of combining student scores and item difficulties onto one logit scale.

Table 46

I then isolated the item analysis from the complete development above by skipping the formation of a single scale from real classroom data. Instead, I feed the IRT item analysis a percent (dummy) data set (Table 46) with the same number of items as in the classroom test (21 items). I then graphed the data strings in Table 46 as a second, simpler, view of IRT item analysis.

Chart 85

Turning right counts (Chart 85, blue) into a right/wrong ratio string (red) yields a very different shape than a straight line right mark count. We now have the rate at which each mark completes a perfect score of 21 or 100%. It starts slow (1/20), with the last mark racing 20 times (20/1) the average rate (10/11 or 11/10, near 1, in Table 46, col 2).

Taking the natural log of the ratio (a logit, Table 46, col 3) creates the Rasch model IRT characteristic curve (Chart 85, purple) with the zero logit point of origin positioned at the 50% normal value. [Ratios and log ratios have no dimensions.]

Chart 86

Winsteps, at this point, has reduced student raw scores and item difficulties (in counts) into one logit scale of student ability and item difficulty with the dimension of a measure. These are then combined into the probability of a right answer to start the item analysis. The percent (dummy) input (Table 46, col 6) replaces this operation (Chart 86). This simplifies the current discussion to just item analysis and precision.

Chart 87

Percent input and Information for one central cell are plotted in Chart 87. Cell information is limited to a maximum of 0.25 at a student raw score of 50% (Table 46, col 7), when combining p*q (0.50 * 0.50 = 0.25 ). The next step is to adjust the cell information for 21 items on the test (Column 8).

Chart 88

Chart 88 completes the comparison of CTT and IRT calculations on Table 46. The inversion of Information (col 9) yields the error variance that aligns with student score measures such that the greatest precision (smallest error variance) is at the point of origin of the logit scale. The square root of the error variance (col 10) yields the CSEM equivalent for IRT measures. And then, by a second inversion these measure values are transformed into the identical normal CSEM values (col 11 - 12) for a CTT item analysis. The total view in Table 45 was too complicated. Charts 85 – 88 are also.

Chart 89

My third, simple, and last view is a flowchart (Chart 89) constructed from the above charts and tables.

The percent (dummy) data produce identical (1.80) standard error of measurement (CSEM) results with CTT and IRT item analysis (Table 46, col 11 - 12 and Chart 89) even though CTT starts with a raw score count (17), and skips the score mean (0.81), and the IRT item analysis starts with a score mean (0.81).

CTT captures the variation (in marks) within a student score in the variance (0.15); IRT captures the variation (in probabilities) as information (0.15). In all cases the score variance and score information are treated with the square root (SQRT, pink) to yield standard errors (estimates of precision: CTT CSEM, on a normal scale in counts, and IRT (CSEM) on a logit scale in measures.

In summary, as CTT score variance and IRT score information (red) increase, CSEM increases on a normal scale (Chart 89). Precision decreases. At the same time IRT error variance (green) and IRT (CSEM) decrease on a logit scale. Precision increases with respect to the Rasch model point of origin zero (50% on a normal scale). This inversion aligns the IRT (CSEM) to student scores in measures on a logit scale.

It appears that the meaning of this depends upon what is being measured and how well it is being measured. CTT measures in counts and sets error (based on the score variance, Chart 89, red) about the student score count on a normal scale (CSEM). IRT converts counts to “measures”. IRT then measures in “measures” and sets error (based on the error variance, Chart 89, green) about the point of origin (zero) on a logit scale that corresponds to 50% on a normal scale.

Chart 90

The two methods of feeding an item analysis are using two different reference points. This was easier to see when I took the core out of Chart 88 and plotted it in a more common form in Chart 90. Precision on both scales is shown in solid black. This line intersects the Rach model IRT characteristic curve where normal is 50% and IRT is zero. At a count of 17 right, the normal scale shows higher precision; the logit scale shows lower precision in respect to the perfect Rasch model.

The characteristic curve is a collection of points where student ability and item difficulties match resulting in students with this ability getting 50% right answers with items with matching difficulties. This situation exists for CTT only at the average test score (mean).

[The slope of the test characteristic curve is given as the inverse of the raw score error variance (3.24, red, Chart 88 - 89, and Table 46).]

Chart 91

Table 91 applies the above thinking to real classroom data (Table 45c). This time the average score was not at 50% but at 81%. The lowest student score on Table 45c was 12 (57%).

In a lost reference, I have read that at the 50% point students do not know anything; it is all chance. I can see that for true-false. That could put CTT and IRT in conflict. A student must know something to earn a score of 50% when there are four options to each item. There is a free 25%. The student must supply the remaining 25%. Also few CCT tests are filled with items that have maximum discrimination and precision. A high quality CTT test can look very much like a high quality IRT test. The difference is that the IRT test item analysis takes more into the calculations than the CTT test when offered as forced-choice (a cheap way to rank students) or as with knowledge and judgment scoring (where students report what they actually know and find meaningful and useful; the basis for effective teaching).

Historically, test reliability was the chief marketing point of standardized tests. In the past decade the precision of individual student scores has replaced test reliability. IRT (CSEM) provides a more marketable product along with promoting the sale of equipment and related CAT services. Again psychometricians on the backside are continuing to support and lend credibility to the claims from the sales office on the front end.

Modernizing Standardize Test Scores

2015-03-11T03:00:00.000-07:00

#13

A single standardized right-count score (RCS) has little meaning beyond a ranking. A knowledge and judgment score (JKS) from the same set of questions not only tells us how much the student may know or can do but also the judgment to make use of that knowledge and skill. A student with a RCS must be told what he/she knows or can do. A student with a KJS tells the teacher or test maker what he/she knows. A RCS becomes a token in a federally sponsored political game. A KJS is a base onto which students build further learning and teachers build further instruction.

Table 40. RCS

Table 41. KJS

The previous two posts dealt with student ability during the test. This one looks at the score after the test. I developed four runs of the Visual Education Statistics Engine: Table 40. RCS, Table 41. KJS (simulated), and after maximizing item discrimination, Table 42. RCSmax, and Table 43. KJSmax.

Table 42. RCSma

Table 43. KJSmax

Test reliability and the standard error of measurement (SEM) with some related statistics are gathered into Table 44. The reliability and SEM values are plotted on Chart 81 below.

Table 44

Students, on average, can reduce their wrong marks by about one half when they at first switch to knowledge and judgment scoring. The most obvious effect of changing 24 of 48 zeros to a value of 0.5 to simulate Knowledge and Judgment Scoring (KJS) was to reduce test reliability (0.36, red). Scoring both quantity and quality also increased the average test score from 64% to 73%.

Psychometricians do not like the reduction in test reliability. Standardized paper tests were marketed as “the higher the reliability the better the test”. Marketing has now moved to “the lower the standard error of measurement (SEM), the better the test”, using computers, CAT and online testing (green). The simulated KJS shows a better SEM (10%) in relation to 12% for RCS. By switching current emphasis from test reliability to precision (SEM) KJS now shows a slight advantage to test makers over RCS.

Chart 80

Chart 80 shows the general relationships between a right-count score and a KJS. This is Chart 4/4 from the previous post tipped on its side with the 60% passing performance replaced with the average scores of 64% RMS and 73% KJS. Again, KJS is not a giveaway. There is an increase in the score, if the student elects to use his/her judgment. There is also an increase in the ability to know what a student actually knows because the student is given the opportunity to report what is known, not to just to mark an answer to every question (even before looking at the test).

Chart 81

Chart 81 expands Chart 80 using the statistics in Table 44. In general there is little difference between a right-count score and a KJS, statistically. What is different is what is known about the student; the full meaning of the score. Right-count scoring delivers a score on a test carefully crafted to deliver a desired on-average test score distribution and cut score. THE TEST IS DESIGNED TO PRODUCE THE DESIRED SCORE DISTRIBUTION. The KJS adds to this the ability to assess what students actually know and can do that is of value to them. The knowledge and judgment score assess the complete student (quantity and quality).

Knowledge and Judgment Scoring requires appropriate implementation for the maximum effect on student development. In my experience, the switch from RCS must be voluntary to promote student development. It must result in a change in the level of thinking and related study habits where the student assumes responsibility for learning and reporting. At that time students feel comfortable changing scoring methods. They like the quality score. It reassures them that they really can learn and understand.

KJS no longer has a totally negative effect on current psychometrician attempts to sharpen their data reduction tools. But there are still the effects of tradition and project size. The NCLB movement demonstrated (failed in part) because low performing schools mimicked the standardized tests rather than tended to teaching and learning. Their attempt to succeed was counterproductive. Doing more of the same does not produce different results. These schools could also be expected to mimic standardized tests offering KJS.

The current CCSS movement is based on the need for one test for all in an attempt to get valid comparisons between students, teachers, schools and states. The effect has been gigantic contracts that only a few companies have the capacity to bid on and little competition to modernize their test scoring.

KJS is then a supplement to RCS. It can be offered on standardized tests. As such, it updates the multiple-choice test to its maximum potential, IMHO. KJS can be implemented in the classroom, by testing companies and entrepreneurs who see the mismatch between instruction and assessment.

Knowledge Factor has already done this with their patented learning/assessment system, Amplifire. It can prepare students online for current standardized tests. Power Up Plus is free for paper classroom tests. (Please see the two preceding posts for more details related to student ability during the test).

Learning Assessment Responsibilities

2015-02-11T03:30:00.000-08:00

Students, teachers, and test makers each have responsibilities that contribute to the meaning of a multiple-choice test score. This post extracts the responsibilities from the four charts in the prior post, Meaningful Multiple-Choice Test Scores, that compares short answer, right-count traditional multiple-choice, and knowledge and judgment scoring (KJS) of both.

Testing looks simple: learn, test, and evaluate. Short answer, multiple-choice, or both with student judgment. Lower levels of thinking, higher levels of thinking, or both as needed. Student ability below, on level, or above grade level. There are many more variables for standardized test makers to worry about in a nearly impossible situation. By the time these have been sanitized from their standardized tests all that remains is a ranking on the test that is of little if any instructional value (unless student judgment is added to the scoring).

Chart 1/4 compares a short answer and a right-count traditional multiple-choice test. The teacher has the most responsibility for the test score when working with pupils at lower levels of thinking (60%). A high quality student functioning at higher levels of thinking could take the responsibility to report what is known or can be done in one pass and then just mark the remainder for the same score (60%). The teacher’s score is based on the subjective interpretation of the student’s work. The student’s score is based on a matching of the subjective interpretation of the test questions with test preparation. [The judgment needed to do this is not recorded in traditional multiple-choice scores.]

Chart 2/4 compares what students are told about multiple-choice tests and what actually takes place. Students are told the starting score is zero. One point is added for each right mark. Wrong or blank answers add nothing. There is no penalty. Mark an answer to every question. As a classroom test, this makes sense if the results are returned in a functional formative assessment environment. Teachers have the responsibility to sum several scores when ranking students for grades.

As a standardized test, the single score is very unfair. Test makers place great emphasis on the right-mark after-test score and the precision of their data reduction tools (for individual questions and for groups of students). They have a responsibility of pointing out that the student on either side of you has an unknowable, different, starting score from chance, let alone your luck on test day. The forced-choice test actually functions as a lottery. Lower scoring students are well aware of this and adjust their sense of responsibility accordingly (in the absence of a judgment or quality score to guide them).

Chart 3/4 compares student performance by quality. Only a student with a well-developed sense of responsibility, or a comparable innate ability, can be expected to function as a high quality, high scoring, student (100% but reported as 60%). A less self-motivated student or with less ability can perform two passes at 100% and 80% to also yield 60%. The typical student, facing a multiple-choice test, will make one pass; marking every question as it comes to earn a quantity, quality, and test score of 60%; a rank of 60%. No one knows which right mark is a right answer.

Teachers and test makers have a responsibility to assess and report individual student quality on multiple-choice tests just as is done on short-answer, essay, project, research, and performance tests. These notes of encouragement and direction provide the same “feel good” effect found in a knowledge and judgment scored quality score when accompanied with a list of what was known or could be done (the right-marked questions).

Chart 4/4 shows knowledge and judgment scoring (KJS) with a five-option question made from a regular four-option question plus omit. Omit replaces “just marking”. A short answer question scored with KJS earns one point for judgment and +/-1 point for right or wrong. An essay question expecting four bits of information (short sentence, relationship, sketch, or chart) earns 4 points for judgment and +/-4 points for an acceptable or not acceptable report. (All fluff, filler, and snow are ignored. Students quickly learn to not waste time on these unless the test is scored at the lowest level of thinking by a “positive” scorer.)

Each student starts with the same multiple-choice score: 50%. Each student stops when each student has customized the test to that student’s preparation. This produces an accurate, honest and fair test score. The quality score provides judgment guidance for students at all levels. It is the best that I know of when operating with paper and pencil. Power Up Plus is a free example. Amplifire refines judgment into confidence using a computer, and now on the Internet. It is just easier to teach a high quality student who knows what he/she knows.

Most teachers I have met question the score of 60% from KJS. How can a student get a score of 60% and only mark 10% of the questions right? Easy. Sum 50% for perfect judgment, 10% for right answers, and NO wrong. Or sum 10% right, 10% right and 10% wrong, and omit 20%. If the student in the example chose to mark 10% right (a few well mastered facts) and then just marked the rest (had no idea how to answer) the resulting score falls below 40% (about 25% wrong). With no judgment, the two methods of scoring (smart and dumb) produce identical test scores. KJS is not a give-away. It is a simple, easy way to update currently used multiple-choice questions to produce an accurate, honest, and fair test score. KJS records what right-count traditional multiple-choice misses (judgment) and what the CCSS movement tries to promote.

Meaningful Multiple-Choice Test Scores

2015-01-14T03:00:00.000-08:00

The meaning of a multiple-choice test score is determined by several factors in the testing cycle including test creation, test instructions, and the shift from teacher to student being responsible for learning and reporting. Luck-on-test-day, in this discussion, is considered to have similar effects on the following scoring methods.

[Luck-on-test-day includes but is not limited to: test blueprint, question author, item calibration, test creator, teacher, curriculum, standards; classroom, home, and in between, environment; and a little bit of random chance (act of God that psychometricians need to smooth their data).]

Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part multiple-choice, and knowledge and judgment scoring (KJS) for both short answer and multiple-choice. These range from familiar manual scoring to what is now easily done with KJS computer software. Each method of scoring has a different starting score with a different meaning. The average customary class room score of 75% is assumed (60% passing).

Chart 1/4

Open ended short answer scores start with zero and increase with each acceptable answer. There may be several acceptable answers for a single short answer question. The level of thinking required depends upon the stem of the question. There may be an acceptable answer for a question both at lower and at higher levels of thinking. These properties carry over into KJS below.

The teacher or test maker is responsible for scoring the test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4). The quality of the answers can be judged by the scorer and may influence which ones are considered right answers.

The open ended short answer question is flexible (multiple right answers) and with some subjectivity; different scorers are expected to produce similar scores. The average test score is controlled by selecting a set of items that is expected to yield an average test score of 75%. The student test score is a rank based on items included in the test to survey what students were expected to master, to group students who know from those who do not know each item, and items that fail to show mastery or discrimination (unfinished items for a host of reasons including luck-on-test-day above).

The open ended short answer question can also be scored as a multiple-choice item. First tabulate the answers. Sort the answers from high to low count. The most frequent answer, on a normal question, will be the right answer option. The next three ranking answers will be real student supplied wrong answer options (rather than test writer created wrong answer options). This pseudo-multiple-choice item can now be printed as a real question on your next multiple-choice test (with answers scrambled).

A high quality student could also mark only right answers on the first pass using the above test (Chart 1/4) and then finish by just marking on the second pass to earn a score of 60%. A lower quality student could just mark each item in order, as is usually done on multiple-choice tests, mixing right and wrong marks, to earn the same score of 60%. Using only a score after the test we cannot see what is taking place during the test. Turning a short answer test into traditional multiple-choice hides student quality, the very thing that the CCSS movement is now promoting.

Chart 2/4

Closed ended right-count four-option multiple-choice scores start with zero and increase with each right mark. Not really!! This is only how this method of scoring has been marketed for a century by only considering a score based on right-counts after the test is completed. In the first place traditional multiple-choice is not multiple-choice, but forced-choice (it lacks one option discussed below). This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This evil flaw in test design was countered, over 50 years ago, by a now defunct “formula scoring”. After forcing students to guess, psychometricians wanted to remove the effect of just marking! It took the SAT until March of this year, 2014, to drop this “score correction”.

[Since there was no way to tell which right answer must be changed for the correction, it made no sense to anyone other than psychometricians wanting to optimize their data reduction tools, with disregard for the effect of the correction on the students taking such a test. Now that 4-option questions have become popular on standardized tests, a student who can eliminate one option can guess from the remaining three for better odds on getting a right mark (which is not necessarily a right answer that reflects recall, understanding, or skill).]

The closed ended right-count four-option multiple-choice question is inflexible (one right answer) and with no scoring subjectivity; all scorers yield the same count of right marks. Again, the average test score is controlled by selecting a set of items expected to yield 75% on-average (60% passing). However, this 75% is not the same as that for the open ended short answer test. As a forced-choice test, the multiple-choice test will be easier; it starts with a 25% on-average advantage. (That means one student may start with 15% and a classmate with 35%.) To further confound things, the level of thinking used by students can also vary. A forced-choice test can be marked entirely at lower levels of thinking.

[Standardized tests control part of the above problems by eliminating almost all mastery and unfinished items. The game is to use the fewest items that will produce a desired score distribution with an acceptable reliability. A traditional multiple-choice scored standardized test score of 60% is a much more difficult accomplishment than the same score on a classroom test.]

A forced-choice test score is a rank of how well a student did on a test. It is not a report of what a student actually knows or can do that will serve as the basis for further instruction and learning. The reasoning is rather simple: the forced-choice score is counted up AFTER the test is finished; this is the final game score. How the game started (25% on-average) and was played is not observed (but this is what sports fans pay for). This is what students and teachers need to know so students can take responsibility for self-corrective learning.

Chart 3/4

[Three student performances that all end up with a traditional multiple-choice score of 60% are shown in Chart 3/4. The highest quality student used two passes, “I know or can do this or I can eliminate all the wrong options” and “I don’t have a clue”. The next lower quality student used three passes, “I know or can do this”; “I can eliminate one or more answer options before marking” and “I am just marking.” The lowest level of thinking student just marks answers one pass, right and wrong, as most low quality, lower level of thinking students do. But what takes place during the test is not seen in the score made after the test. The lowest quality student must review all past work (if tests are cumulative) or continue on with an additional burden as a low quality student. A high quality student needs only to check on what has not been learned.]

Chart 4/4

Knowledge and Judgment scores start at 50% for every student plus one point for acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice). (Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right = 2) Omitting an answer is good judgment to report what has yet to be learned or to be done (understood). Omitting keeps the one point for good judgment. An unacceptable or wrong mark is poor judgment. You lose one point for bad judgment.

Now what is hidden with forced-choice scoring is visible with knowledge and Judgment Scoring (KJS). Each student can show how the game is played. There is a separate student score for quantity and for quality. A starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge Factor sets the starting score near 75%. Judgment is far more important than knowledge in high risk occupations.]

KJS includes a fifth answer option: omit (good judgment to report what has yet to be learned or understood). When this option is not used, the test reverts to forced-choice scoring (marking one of the four answer options for every question).

A high quality student marked 10 right out of 10 marked and then omitted the remainder (in two passes through the test) or managed to do a few of one right and one wrong (three passes) for a passing score of 60% in Chart 4/4. A student of less quality did not omit but just marked for a score of less than 50%. A lower level of thinking, low quality student marked 10 right and just marked the rest (two passes) for a score of less than 40%. KJS yields a score based on student judgment (60%) or on the lack of that judgment (less than 50%).

In summary, the current assessment fad is still oriented on right marks rather than on student judgment (and development). Students with a practiced good judgment develop the sense of responsibility needed to learn at all levels of thinking. They do not have to wait for the teacher to tell them they are right. Learning is stimulated and exhilarating. It is fun to learn when you can question, get answers, and verify a right answer or a new level of understanding; when you can build on your own trusted foundation.

Low quality students learn by repeating the teacher. High quality students learn by making sense of an assignment. Traditional multiple-choice (TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards all-levels-of-thinking. TMC requires little sense of responsibility. KJS rewards (encourages) the sense of responsibility needed to learn at all levels of thinking.

1. A short answer, hand scored, test score is an indicator of student ability and class ranking based on the scorer’s judgment. The scorer can make a subjective estimate of student quality.

2. A TMC score is only a rank on a completed test with increased confounding at lower scores. A score matching a short answer score is easier to obtain in the classroom and much more difficult to obtain on a standardized test.

3. A KJS test score is based on a student, self-reporting, estimate of what the student knows and can do on a completed test (quantity) and an estimate of the student’s ability to make use of that knowledge (judgment) during the test (quality). The score has student judgment and quality, not scorer judgment and quality.

In short, students who know that they can learn (get rapid feedback on quantity and quality),who want to learn, enjoy learning (see Amplifire below). All testing methods fail to promote these student development characteristics unless the test results are meaningful, easy to use by students and teachers, and timely. Student development requires student performance, not just talking about it or labeling something formative assessment.

Power Up Plus (PUP or PowerUP) scores both TMC and KJS. Students have the option of selecting the method of scoring they are comfortable with. Such standardized tests have the ability to estimate the level of thinking used in the classroom and by each student. Lack of information, misinformation, misconceptions and cheating can be detected by school, teacher, classroom, and student.

Power Up Plus is hosted at TeachersPayTeachers to share what was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual teachers who want to upgrade their multiple-choice tests for formative, cumulative, and exit ticket assessment. Good teachers, working within the bounds of accepted standards, do not need to rely on expensive assessments. They (and their students) do need fast, easy to use, test results to develop successful high quality students.

I hope your students respond with the same positive enthusiasm that over 90% of mine did. We need to assess students to promote their abilities. We do not need to primarily assess students to promote the development of psychometric tools that yield far less than what is marketed.

A Brief History:

Geoff Masters (1950- ) A modification of traditional multiple-test test performance.

Created partial credit scoring for the Rasch model (1982) as a scoring refinement for traditional right-count multiple-choice. It gives partial credit for near right marks. It does not change the meaning of the right-count score (as quantity and quality have the same value by default [both wrong marks and blanks are counted as zeros], only quantity is scored). The routine is free in Ministep software.

Richard A. Hart (1930- ) Promotes student development by student self-assessment of what each student actually knows and can do, AFTER learning, with “next class period” feedback.

Knowledge and Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to reduce the time needed for students to write, and for me to score, short answer and essay questions. I created software (1981) to score multiple-choice, both right-count, and knowledge and judgment, to encourage students to take responsibility for what they were learning at all levels of thinking in any subject area. Students voted to give knowledge and judgment equal value. The right-count score retains the same meaning (quantity of right marks) as above. The knowledge and judgment score is a composite of the judgment score (quality, the “feel good” score AFTER learning) and the right-count score (quantity). Power Up Plus (2006) is classroom friendly (for students and teachers) and a free download: Smarter Test Scoring and Item Analysis.

Knowledge Factor (1995- ) Promotes student learning and retention by assessing student knowledge and confidence, DURING learning, with “instant” feedback to develop “feeling good” during learning.

Knowledge Factor was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment program, Amplifire, places much more weight on confidence than on knowledge (a wrong mark may reduce the score by three times as much as a right mark adds). The software leads students through the steps needed to learn easily, quickly and in a depth that is easily retained for more than a year. Students do not have to master the study skills and the sense of responsibility needed to learn at all levels of thinking needed for master with KJS. Amplifire is student friendly, online, and so very commercially successful in developed topics that it is not free.

[Judgment and confidence are not the same thing. Judgment is measured by performance (percent of right marks), AFTER learning, at any level of student score. Confidence is a good feeling that Amplifier skillfully uses to promote rapid learning, DURING learning and self-assessment, into a mastery level. Students can take confidence in their practiced and applied self-judgment. The KJS and Amplifire test scores reflect the complete student. IMHO standardized tests should do this also, considering their cost in time and money.]

Information Functions - Adding Unbalanced Items

2014-12-10T03:00:00.000-08:00

Adding 22 balanced items to Table 33 of 21 items, in the prior post, resulted in a similar average test score (Table 36) and the same item information functions (the added items were duplicates of those in the first Nurse124 data set of 21 items.) What happens if an unbalance set of 6 items is added? I just deleted the 16 high scoring additions from Table 36. Both balanced additions (Table 36) and unbalanced additions (Table 39) had the same extended range of item difficulties (5 to 21 right marks, or 23% to 95% difficulty).

Table 33

Table 36

Table 39

Adding a balanced set of items to the Nurse124 data set kept the average score the same: 80% and 79% (Table 36). Adding a set of more difficult items to the Nurse124 data decreased the average score to 70% (Table 39) and decreased student scores. Traditionally, a student’s overall score is then the average of the three test scores: 80%, 79% and 70% or 76% for an average student (Tables 33, 36, and 39). An estimate of a student’s “ability” is thus directly dependent upon his test scores which are dependent upon the difficulty of the items on each test. This score is accepted as a best estimate of the student’s true score. This value is a best guess of future test scores. This makes common sense, that past is a predictor of future performance.

[Again a distinction must be made between what is being measured by right mark scoring (0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a test the student may not be able to read or understand. The other also indicates the quality of each student’s knowledge; the ability to make meaningful use of knowledge and skills. Both methods of analysis can use the exact same tests. I continue to wonder why people are still paying full price but harvesting only a portion of the results.]

The Rasch model IRT takes a very different route to “ability”. The very same student mark data sets can be used. Expected IRT student scores are based on the probability that half of all students with a given ability location will correctly mark a question with a comparable difficulty location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.) [The location starts from the natural log of a ratio of right/wrong score and wrong/right difficulty. A convergence of score and difficulty yields the final location. The 50% test score becomes the zero logit location, the only point right mark scoring and IRT scores are in full agreement.]

The Rasch model IRT converts student scores and item difficulties [in the marginal cells of student data] into the probabilities of a right answer (Table 33b). [The probabilities replace the marks in the central cell field of student data.] It also yields raw student scores, and their conditional standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather than the count of right marks. (For more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student test score; a student with a given ability can obtain a range of scores on future tests without affecting his ability location. A calibrated item can yield a range of difficulties on future tests without affecting its difficulty calibrated location. This makes sense only in relation to the trust you can have in the person interpreting IRT results; that person’s skill, knowledge, and (most important) experience at all levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look right”, can be eliminated from the data set. Also the same data set (Table 33, Table 36, and Table 39) can be treated differently if it is classified as field test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively equilibrated test results to optimize federal dollars during the NCLB era by showing gradual continuing improvement. It is time to end the ranking of students by right mark scoring (0,1 scoring) and include KJS, or PCM (0,1,2 scoring) [that about every state education department has: Winsteps], so that standardized testing yields the results needed to guide student development: the main goal of the CCSS movement.

The need to equilibrate a test is an admission of failure. The practice has become “normal” because failure is so common. It opened the door to cheating at state and national levels. [To my knowledge no one has been charged and convicted of a crime for this cheating.] Current computer adaptive testing (CAT) hovers about the 50% level of difficulty. This optimizes psychometric tools. Having a disinterested party outside of the educational community doing the assessment analysis and online CAT reduce the opportunity to cheat. They do not IMHO optimize the usefulness of the test results. End-of-course tests are now molding standardized testing into an instrument to evaluate teacher effectiveness rather than assess student knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Information Functions - Adding Balanced Items

2014-11-12T03:30:00.000-08:00

I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information functions (IIF). This post makes use of that observation to improve the Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post, contained no items with difficulties below 45%. The item information functions (IIF) were also skewed (Chart 75). This is not the symmetrical display associated with the Rasch IRT model. I reasoned that adding a balanced set of items would increase the number of IFFs without changing the average item difficulty.

Table 36a shows the addition of a balanced set of 22 items to the Nurse124 data set of 21 items. As each lower ranking item was added, one or more high ranking items were added to keep the average test score near 80%. This table added six lower ranking items and 16 higher scoring items resulting in an average score of 79% and 43 items total.

Table 36

The average item difficulty for the Nurse124 data set was 17.57 and the expanded set was 17.28. The average test score of 80% came in as 79%. Student scores (ability) also remained about the same. [I did not take the time to tweak the additions for a better fit.] Both item difficulty and student score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did change with the addition of more items (Chart 79 below). The number of cells containing information expanded from 99 to 204 cells. The average right count student score increased from 17 to 34.

Table 36c shows the resulting item information functions (IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set of 9 different student scores now contains 12 different scores, however the range of student scores is comparable between the two sets. This makes sense as the average test scores are similar and the student scores are also about the same.

Table 37

Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the student ability range of 12 rankings (student score right count/% right). The trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near the average test score of 79%. This was expected as the maximum information value within an IIF occurs when the item difficulty and student ability score match. [The three bottom traces on Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly skewed the further the IIFs are away from this maximum, 11/50%, trace (blue square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average test score, produced the identical total information (1.41) from both the Nurse124 and the supplemented data sets. But these values also drifted apart for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum information (2.17). Here again is evidence that prompts psychometricians to work closely to the 50% or zero logit point to optimize their tools when working on low quality data (limiting scoring only to right counts rather than also offering students the option to assess their judgment to report what is actually meaningful and useful; to assess their development toward being a successful, independent, high quality achiever). [Students that only need some guidance rather than endless “re-teaching”; that, for the most part, consider right count standardized tests a joke and a waste of time.]

Chart 78

Tabel 38

The test information function for the supplemented data set Is the sum of the information in all 17 item information functions (Table 38 and Chart 78). It took 16 easy items to balance 6 difficult items. The result was a marked increase in precision at the student score levels between 30/70% and 32/74%. [More at Rasch Model Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124 data, the supplemented data (adding a balanced set of items that keeps student ability and item difficulty unchanged), and the CTT and IRT data reduction methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM) for comparison. In general, both CTT (blue) and IRT inverted (red) produced comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to 2.54. The standard deviation (SD) between student test scores increased from 2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and 57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the student test score the lower the CTT CSEM, the more precise the student score measurement, the less error. That makes sense.

The above statement about a CTT CSEM must be related to a second statement that the more item information, the greater the precision of measurement by the item at this student score rank. The first statement harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks) in Table 36c.

The binomial variance CTT CSEM view is then comparable to the reciprocal or inverted (1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue, CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced similar results even with an average test score of 79% that is 29 percentage points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information functions, in Table 36c from columns of probabilities (of right marks). Layering one IIF on top of another across the student score distribution yields the test information function (Chart 78).

The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting

a right answer that were generated from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet, at the count only right mark level, they deliver very similar results, with the exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

Customizing Test Precision - Information Functions

2014-10-08T03:00:00.000-07:00

(Continued from the prior two posts.)

The past two posts have established that there is little difference between classical test theory (CTT) and item response theory (IRT) in respect to test reliability and conditional error of measurement (CSEM) estimates (other than the change in scales). IRT now is the analysis of choice for standardized tests. The Rasch model IRT is the easiest to use and also works well with small data sets including classroom tests. How two normal scales for student scores and item difficulties are combined onto one IRT logit scale is no longer a concern to me, other than the same method must be used throughout the duration of an assessment program.

Table 33

What is new and different from CTT is an additional insight from the IRT data in Table 32c (information p*q values). I copied Table 32 into Table 33 with some editing. I colored the cells holding the maximum amount of information (0.25) yellow in Table 33c. This color was then carried back to Table 33a, Right and Wrong Marks. [Item Information is related to the marginal cells in Table 33a (as probabilities), and not to the central cell field (as mark counts).] The eleven item information functions (in columns) were re-tabled into Table 34 and graphed in Chart 75. [Adding the information in rows yields the student score CSEM in Table 33c.]

Table 34

Chart 75

The Nurse124 data yielded an average test score of 16.8 marks or 80%. This skewed the item information functions away from the 50% or zero logit difficulty point (Chart 75). The more difficult the item, the more information developed, from 0.49 to 1.87 for 95% right count to a maximum at 54% and 45% right count. [No item on the test had a difficulty of 50%.]

Table 35

Chart 76

The sum of information (59.96) by item difficulty level and student score level is tabled in Table 35 and plotted as the test information function in Chart 76. This test does not do a precise job of assessing student ability. The test was most precise (19.32) at the 16 right count/76% right location. [Location can be designated by measure (logit), input raw score (red) or output expected score (Table 33b).]

The item with an 18 right count/92% right difficulty (Table 35) did not contribute the most information individually but did as a group of three items (9.17). The three highest scoring, easiest, items (counts of 19, 20, and 21) are just too easy for a standardized test but may be important survey items needed to verify knowledge and skills for this class of high performing students. None of these three items reached an information level maximum of 1/4. [It now becomes apparent how items can be selected to produce a desired test information function.]

The more information available is interpreted as greater precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q)) on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT + WG)^2 or (3 x 3)/36 = 1/4].

Each item information function spans the range of student scores on the test (Chart 76). Each item information function measures student ability most precisely near the point that item difficulty and student ability match (50% right) along the IRT S-curve. [The more difficult an item, the more ability students must have to mark correctly 50% of the time. Student ability is the number correct on the S-curve. Item difficulty is the number wrong on the S-curve (see more at Rasch Model Audit).]

Extracting item information functions from a data table provides a powerful tool (a test information function) for psychometricians to customize a test (page 127, Maryland 2010). A test can be adjusted for maximum precision (minimum CSEM) at a desired cut point.

The bright side of this is that the concept of “information” (not applicable to CTT), and the ability to put student ability and item difficulty on one scale, gives psychometricians powerful tools. The dark side is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the NCLB era, as psychometrics has made marked improvements, the student mark data it is being supplied has remained in the casino arena: Mark an answer to each question (even if you cannot read or understand the question), do not guess, and hope for good luck on test day.

The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak psychometric performance. Standardized testing has migrated away from letting students report what they know and can do to a lottery that compares their performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day.

The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.]

It does not have to be that way. Knowledge and Judgment Scoring (classroom friendly) and the partial credit Rasch model (that is included in the software states use) can both update traditional multiple-choice to the levels of thinking required by the common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

Conditional Standard Error of Measurement - Precision

2014-09-10T03:00:00.000-07:00

10
(Continued from prior post.)

Table 32a contains two estimates (red) of the test standard error of measurement (SEM) that are in full agreement. One estimate, 1.75, is from the average of the conditional standard error of measurements (CSEM, green) for each student raw score. The traditional estimate, 1.74, uses the traditional test reliability, KR20. No problem here.

The third estimate of the test SEM in Table 32c is different. It is based on CSEM values expressed in logits (the natural log, 2.718) rather than on the normal scale. The values are also inverted in relation to the traditional values in Table 32 (Chart 74). There is a small but important difference. The IRT CSEM values are much more linear that the CTT CSEM values. Also the center of this plot is the mean of the number of items (Chart 30, prior post), not the mean of the item difficulties or student scores. [Also most of this chart was calculated as most of these relationships do not require actual data to be charted. Only nine score levels came from the Nurse124 data.]

Chart 74 shows the binomial CSEM values for CTT (normal) and IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in logits) = 1/(SEM(Raw Score)”, 2007. I then adjusted each of these so the corresponding curves, on the same scale, crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The extreme values for no right and all right were not included. CSEM values for extreme values go to zero or to infinity with the following result:

“An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm

Precision is then not a constant across the range of student scores for both methods of analysis. The test SEM of 0.64 logits is comparable to 1.74 counts on the normal scale.

The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values estimates the test SEM as 1.75 counts out of a range of 21 items. This is less than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and useful.

For IRT analysis, the CSEM indicates the degree that the data fit the perfect Rasch model (3). A better fit also results in more believable and useful results.

“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation of an imagined error distribution representing the possible distribution of observed values around their “true” theoretical value. This precision is based on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the measures or estimates correspond to a reference standard outside the data, in this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm

Precision also has some very practical limitations when delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very neat tables showing the number of items that must be on a test to obtain a desired degree of precision expressed in logits and in confidence limits. The closer the test “targets” an average score of 50%, the fewer items needed for a desired precision.

The two top students, with the same score of 20, missed items with different difficulties. They both yield the same CSEM. The CSEM ignores the pattern of marks and the difficulty of items. A CSEM value obtained in this manner is related only to the raw score. Absolute values for the CSEM are sensitive to item difficulty (Table 23a and 23b).

The precision of a cut score has received increasing attention during the NCLB era. In part, court actions have made the work of psychometricians more transparent. The technical report for a standardized test can now exceed 100 pages. There has been a shift of emphasis from test SEM, to individual score CSEM, to IRT information as an explanation of test precision.

“(Note that the test information function and the raw score error variance at a given level of proficiency [student score], are analogous for the Rasch model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The conditional standard error of measurement is the inverse of the information function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.

I cannot end this without repeating that this discussion of precision is based on traditional multiple-choice (TMC) that only ranks students, a casino operation. Students are not given the opportunity to include their judgment of what they know or can do that is of value to themselves, and their teachers, in future learning and instruction, as is done with essays, problem solving, and projects. This is easily done with knowledge and judgment scoring (KJS) of multiple-choice tests.

(Continued)

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Test Score Reliability - TMC and IRT

2014-08-13T03:00:00.000-07:00

The main purpose of this post is to investigate the similarities between traditional multiple-choice (TMC), or classical test theory (CTT), and item response theory (IRT). The discussion is based on TMC and IRT as the math is simpler than when using knowledge and judgment scoring (KJS) and the IRT partial credit model (PCM). The difference is that TMC and IRT input marks at the lowest levels of thinking; resulting in a traditional ranking. KJS and PCM input the same marks at all levels of thinking; resulting in a ranking plus a quality indication of what a student actually knows and understands that is of value to that student (and teacher) in further instruction and learning.

I applied the instructions in the Winsteps Manual, page 576, for checking out the Winsteps reliability estimate computation, to the Nursing124 data used in the past several posts (22 students and 21 items). Table 32 is a busy table that is discussed in the next several posts. The two estimates for test reliability (0.29 and 0.28, orange) are identical based on TMC and IRT (considering rounding errors).

Table 32a shows the TMC test reliability estimated from the ratio of true variance to total variance. The total variance between scores, 4.08, minus the error variance within items, 2.95, yields the true variance, 1.13. The KR20 then completes the reliability calculation to yield 0.29 using normal values.

For an IRT estimate of test reliability, the values on a normal scale are converted to the logit scale (ln ratio w/r). In this case, the sum of item difficulty logits, ln ratio w/r, was -1.62 (Table 32b). This value is subtracted from each item difficulty logit value to shift the mean of the item distribution to the zero logit point (Rasch Adjust, Table 32b). Winsteps then optimizes the fit of the data (blue) to the perfect Rasch Model. Now comparable student ability and item difficulty values are in register at the same locations on a single logit scale. The 50% point on the normal scale is now at the zero location for both student ability and item difficulty.

The probability for each right mark (expected score ) in the central cells is the product of the respective marginal cells (blue) for item difficulty (Winsteps Table 13.1) and student ability (Winsteps Table 17.1). The sum of these probabilities (Table 32b, pink) is identical to the normal Score Mean (Table 32a, pink).

The “information” in each central cell, in Table 32c, was obtained by p*q or p * (1 - p) from Table 32b. Adding up the internal cells for each score yields the sum of information for that score.

The next column shows the square root of the sum of information. This value inverted yields the conditional standard error of measurement (CSEM). The conditional variance (CVar) within each student ability measure is then obtained by reversing the equation for normal values in Table 32a: The CVar is obtained as the square of the CSEM instead of the CSEM being obtained as the square root of the CVar. The average of these values is the test model error variance (EV) in measures: 0.43.

The observed variance (OV) between measures is estimated in the exact same way as is done for normal scores: the variance between measures from Excel =VAR.P (0.61) or the square of the SD: 0.78 squared = 0.61.

The test reliability in measures {(OV –EV)/OV = (0.61 – 0.45)/0.61 = 0.28) is then obtained from the same equation for normal values: {total variance – error variance)/total variance = (4.08 – 2.96)/4.08 = 0.29, in table 32a. Normal and measure dimensions for the same value differ, but ratios do not, as a ratio has no dimension. TMC and IRT produced the same values for test reliability. As will KJS and the PCM.

(Continued)

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

Small Sample Math Model - SEMs

2014-07-09T03:00:00.000-07:00

The test standard error of measurement (SEM) can be calculated in two ways: The traditional way is by relating the variance between student scores and within item difficulties; between an external column and the internal cell columns.

The second way harvests the variance conditioned on each student score and then sums the CSEM (SQRT(conditional student score error variance)) for the test. The first method links two properties: student ability and item difficulty. The second only uses one property: student ability.

I set up a model with 12 students and 11 items (see previous post and Table26.xlsm below). Extreme values of zero and 100% were excluded. Four samples with average test scores of 5, 6, 7 (Table 29), and 8 were created with the standard deviation (1.83) and the variance within item difficulties (1.83) held constant. This allowed the SEM to vary between methods.

The calculation of the test SEM (1.36) by way of reliability (KR20) is reviewed on the top level of Chart 73. The test SEM remained the same for all four tests.

My first calculation of the test SEM by way of conditional standard error of measurement (CSEM) began with the deviation of each mark from the student score (Table 29 center). I squared the deviations and summed to get the conditional variance for each score. The individual student CSEM is given as the square root of the conditional variance (the SD of the conditional variance). The test SEM (1.48) is then the sum of the student CSEM values.

[My second calculation was based on the binomial standard error of measurement given in Crocker, Linda, and James Algina, 1986, Introduction to Classical & Modern Test Theory, Wadsworth Group, pages 124-127.

By including the “correction for obtaining unbiased estimates of population variance”, (n/(n – 1), the SEM value increased from 1.48 to 1.55 (Table 29). This is a perfect match to the binomial SEM.]

The two SEMs are then based on different sample sizes and different assumptions. The traditional SEM (1.36) is based on the raggedly distributed small sample size in hand. The binomial SEM (1.55) assumes a perfectly normally distributed large theoretical population.

[Variance calculations (variance is additive):

Test variance: Score deviations from the test mean (as counts), squared, and summed = a sum of squares (SS). SS/N = MSS or variance: 3.33. {Test SD = SQRT(Var) = 1.83. Test SEM = 1.36.}
Conditional error variance: Deviations from the student score (as a percent), squared, and summed = the conditional error variance (CVar) for that student score. {Test SEM = Average SQRT(CVar) = 1.48 (n) and 1.55 (n-1)}
Conditional error variance: Variance Within the Score row (Excel, VAR) x (n or n - 1) = the CVar for that student score. {Test SEM VAR.P = 1.48 and VAR.S = 1.55.]

Squaring values produces curved distributions (Chart 73). The curves represent the possible values. They do not represent the number of items or student scores having those values.

The True MSS = Total MSS – Error MSS = 3.33 -1.83 = 1.50, involves subtracting a convex distribution centered on the average test score from a concave distribution centered on the maximum value of 0.25 (not on the average item difficulty).

The student score MSS is at a maximum when the item error SS is at a minimum. The error MSS is at a maximum (0.25) when the student score MSS is at a minimum (0.00). This makes sense. This item is perfectly aligned with the student score distribution at a point where there is not differing from the average test score.

The KR20 is then a ratio of the True MSS/Total MSS, 1.50/3.33 = 0.50. [KR20 ranges from 0 to 1, not reproducible to fully reproducible]. The test SEM is then a portion, SQRT(1 – KR20) of the SD [also 1.83 in this example, SQRT(3.33)] = SQRT(1 – 0.50) * 1.83 = 1.36.

I was able to set the test SEM estimates using KR20 all to 1.36 for all four tests by setting the SD of student scores and the item error MSS to constant values by switching a 0 and 1 pair in student mark patterns. [The SD and the item error MSS do not have to be the same values.]

All possible individual student score binomial CSEM values for a test with 11 items are listed in Table 30. The CSEM is given as the SQRT(conditional variance). The conditional variance is: (X * (n – X))/(n – 1) or n*(pg) * (n/(n - 1)). There is then no need to administer a test to calculate a student score binomial CSEM value. There is a need to administer a test to find the test SEM. The test SEM (Table 29) is the sum of these values, 1.55.

The student CSEM and thus the test SEM values are derived only from student mark patterns. They differ from the test SEM values derived from the KR20 (Table 31). With KR20 derived values held constant, the binomial CSEM derived values for SEM decreased with higher test scores. This makes sense. There is less room for chance events. Precision increases with higher test scores.

Given a choice, a testing company would select the KR20 method using CTT analysis to report test SEM results.

[The same SEM values for tests with 5 right and 6 right resulted from the fact that the median score was 5.5. The values for 5 right and 6 right fall an equal distance from the mean on either side. Therefore 5 and 6 or 6 and 5 both add up to 11.]

I positioned the green curve on Chart 73 using the above information.

A CSEM value is independent from the average test score and item difficulties. (Swapping paired 0s and 1s in student mark patterns to adjust the item error variance made no difference in the CSEM value.) The average of the CSEM values, the test SEM, is dependent on the number of items on the test with each value. If all scores are the same, the CSEMs and the SEM will be the same (Tables 30 and 31).

I hope at this stage to have a visual mathematical model that is robust enough to make meaningful comparisons with the Rasch IRT model. I would like to return to this model and do two things (or have someone volunteer do it):

Combine all the features that have been teased out, in Chart 72 and Chart 73, into one model.
Animate the model in a meaningful way with change gages and history graphs.

Now to return to the Nursing data that represent the real classroom, filled with successful instruction, learning, and assessment.

- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request. (Files hosted at nine-patch.com are also being relocated now that Nine-Patch Multiple-Choice, Inc has been dissolved.)

The Best of the Blog - FREE

Small Sample Math Model - Item Discrimination

2014-06-18T03:00:00.000-07:00

The ability of an item to place students into two distinct groups is not a part of the mathematical model developed in the past few posts. Discrimination ability, however, provides insight into how the model works. A practical standardized test must have student scores spread out enough to assign desired rankings. Discriminating items produce this spread of student scores.

Current CCSS multiple-choice standardized test scoring only ranks, it does not tell us what a student actually knows that is useful and meaningful to the student as the basis for further learning and effective instruction. This can be done with Knowledge and Judgment Scoring and the partial credit Rasch IRT model using the very same tests. This post is using traditional scoring as it simplifies the analysis (and the model) to just right and wrong, no judgment or higher levels of thinking are required of students.

I created a simple data set of 12 students and 11 items (Table 26) with an average score of 5. I then modified this set to produce average scores of 6, 7, and 8 (Table 27). [This can also be considered as the same test given to students in grades 5, 6, 7, and 8.]

The item error mean sum of squares (MSS), variance, for a test with an average score of 8 was 1.83. I then adjusted the MSS for the other three grades to match this value. A right and a wrong mark were exchanged in a student mark pattern (row) to make an adjustment (Table 27). I stopped with 1.85, 1.85, 1.83, and 1.83 for grades 5, 6, 7, and 8. (This forced the KR20 = 0.495 and SEM = 1.36 to remain the same for all four sets.)

The average item difficulty (Table 27) varied, as expected, with the average test score. The average item discrimination (Pearson r and PBR) (Table 28) was stable. In general, with a few outliers in this small data set, the most discriminating items had the same difficulty as the average test score. [This behavior for the item discrimination to be maximized at the average test score is a basic component of the Rasch IRT model, which by design limits, must use the 50% point.]

Scatter chart, Chart 71, has sufficient detail to show that items tend to be most discriminating when they have a difficulty near the average test score (not just near 50%).

The question is often asked, “Do tests have to be designed for an average score of 50%?” If the SD remains the same, I found no difference in the KR20 or SEM. [The observed SD is ignored by the Rasch IRT model used by many states for test analysis.]

The maximum item discrimination value of 0.64 was always associated with an item mark pattern in which all right marks and all wrong marks were in two groups with no mixing of right and wrong marks. I loaded a perfect Guttman mark pattern and found that 0.64 was the maximum corrected value for this size of data set. (The corrected values are better estimates than the uncorrected values in a small data set.)

Items of equal difficulty can have very different discrimination values. In Table 26, three items have a difficulty of 7 right marks. Their corrected discrimination values were 0.34 and 0.58.

Psychometricians have solved the problem this creates in estimating test reliability by deleting an item and recalculating the test reliability to find the effect of any item in a test. The VESEngine (free download below) includes this feature: Test Reliability (TR) toggle button. Test reliability (KR20) and item discrimination (PBR) are interdependent on student and item performance. A change in one usually results in a change in one or more of the other factors. [Student ability and item difficulty are considered independent using the Rasch model IRT analysis.] {I have yet to determine if comparing CTT to IRT is a case of comparing apples to apples, apples to oranges or apples to cider.}

Two additions to the model (Chart 72) are the two distributions of the error MSS (black curve) and the portion of right and wrong marks (red curve). Both have a maximum of 1/4 at the 50% point and a minimum of zero at each end. Both are insensitive to the position of right marks in an item mark pattern. The average score for right and for wrong marks is sensitive to the mark pattern as the difference between these two values determines part of the item discrimination value; PBR = (Proportion * Difference in Average Scores)/SD.

Traditional, classical test theory (CTT), test analysis can use a range of average test scores. In this example there was no difference in the analysis with average test scores of 5 right (45%) to 8 right (73%).

Rasch model item response theory (IRT) test analysis transforms normal counts into logits that have only one reference point of 50% (zero logit) when student ability and item difficulty are positioned on one common scale. This point is then extended in either direction by values that represent equal student ability and item discrimination (50% right) from zero to 100% (-50% to +50%) using the Rasch model IRT. This scale ignores the observed item discrimination.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

Test Scoring Math Model - Precision

2014-05-07T03:00:00.000-07:00

The precision of the average test score can be obtained from the math model in two ways: directly from the mean sum of squares (MSS) or variance, and traditionally, by way of the test reliability (KR20).

I obtained the precision of each individual student test score from the math model by taking the square root of the sum of squared deviations (SS) within each score mark pattern (green, Table 25). The value is called the conditional standard error of measurement (CSEM) as it sums deviations for one student score (one condition), not for the total test.

I multiplied the mean sum of squares (MSS) by the number of items averaged (21) to yield the SS (0.15 x 21 = 3.15 for a 17 right mark score) (or I could have just added up the squared deviations). The SQRT(3.15) = 1.80 right marks for the CSEM. Some 2/3 of the time a re-tested score of 17 right marks can be expected to fall between 15.20 and 18.80 (15 and 19) right marks (Chart 70).

The test Standard Error of Measurement (SEM) is then the average of the 22 individual CSEM values (1.75 right marks or 8.31%).

The traditional derivation of the test SEM (the error in the average test score) combines the test reliability (KR20) and the SD (spread) of the average test score.

The SD (2.07) is from the SQRT(MSS, 4.08) between student scores. The test reliability (0.29) is the ratio of the true variance (MSS, 1.12) to the total variance (MSS, 4,08) between student scores (see previous post).

The expectation is that the greater the reliability of a test, the smaller the error in estimating the average test score. An equation is now needed to transform variance values on the top level of the math model to apply to the lower linear level.

SEM = SQRT(1 – KR20) * SD = SQRT(1 – 0.29) * 2.07 = SQRT(0.71) * 2.07 = 0.84 * 2.07 = 1.75 right marks.

The operation of “1 – KR20” aligns the value of 0.71 to extract the portion of the SD that represents the SEM. If the test reliability goes up, the error in estimating the average test score (SEM) goes down.

Chart 70 shows the variance (MSS), the SS, and the CSEM based on 21 items, for each student score. It also shows the distribution of the CSEM values that I averaged for the test SEM.

The individual CSEM is highest (largest error, poorer precision) when the student score is 50% (Charts 65 and 70). Higher student scores yield lower CSEM values (better precision). This makes sense.

The test SEM (the average of the CSEM values) is related to the distribution of student test scores (purple dash, Chart 70). Adding easy items (easy in the sense that the students were well prepared) decreases error, improves precision, reduces the SEM.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls. Quick Start

Test Scoring Math Model - Reliability

2014-04-23T03:00:00.001-07:00

An estimate of the reliability or reproducibility of a test can be extracted from the variation within the tabled right marks (Table 25). The variance from within the item columns is related to the variance from within the student score column.

The error within items variance (2.96) and total variance (MSS) between student scores (4.08) are both obtained from columns in Table 25b (blue, Chart 68). The true variance is then 4.08 – 2.96 = 1.12.

The ratio of true variance to the total variance between scores (1.12/4.08) becomes an indicator of test reliability (0.28). This makes sense.

A test with perfect reliability (4.08/4.08 = 1.0) would have no variation, error variance = 0, within the item columns in Table 25. A test with no reliability (0.0/4.08) would show equal values (4.08) for within item columns, and between test scores.

The KR20 formula then adjusts the above value (0.28 x 21/20) to 0.29 [from a large population (n) to a small sample value (n-1)]. The KR20 ratio has no unit labels (“var/var” = “”). All of the above takes place on the upper (variance) level of the math model.

Doubling the number of students taking the test (Chart 69) has no effect on reliability. Doubling the number of items doubles the error variance but increases the total variance by the square. The test reliability increases from 0.29 to 0.64.

The square root of the total variance between scores (4.08) yields the standard deviation (SD) for the score distribution [(2.02 for (n) and 2.07 for (n-1)] on the lower floor of the math model.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls. Quick Start

Test Scoring Math Model - Variance

2014-03-05T03:00:00.000-08:00

The first thing I noticed when inspecting the top of the test scoring math model (Table 25) was that the variation within the central cell field has a different reference point (external to the data) than the variation between scores in the marginal cell column (internal to the data). Also the variation within the central cell field (the variance) is harvested in two ways: within rows (scores) and within columns (items).

The mean sum of squared deviations (MSS) or variance within a column or a row has a fixed range (Chart 64 and Chart 65). The maximum occurs when the marks are 1/2 right and 1/2 wrong (1/2 x 1/2 = 1/4 or 25%). [Variance also equals p * q or (Right * Wrong)/(Right + Wrong)] The contribution each mark makes to the variance is distributed along this gentle curve. The variable data are fit to a rigid model.

I obtained the overall shape of these two variances by folding Chart 64 and Chart 65 into Photo 64-65. The result is a dome or a depression above or below the upper floor of the model.

The peak of the dome (maximum variance) is reached when a student functioning at 50% marks an item with 50% difficulty. Standardized test makers try to maximize this feature of the model. The larger the mismatch between item difficulty and student ability, the lower down the position of the variance on the dome. CAT attempts to adjust item difficulty to match student preparedness.

Chart 66 is a direct overhead view of the dome. Elevation lines have been added at 5% intervals from zero to 25%. I then fitted the data from Nursing124 to the roof of the model. The data only spread over one quadrant of the model. The data could completely cover the dome in an ideal situation in which every combination of score and difficulty occurred.

The total test variance within items is then the sum of the variance within all items (0.04 to 0.25 = 2.96). The total test variance within scores is the sum of the variance of all scores (0.05 to 0.24 = 3.33). See Table 8.

The math model adjusts to fit the data in the marginal cell student score column (variance between scores). The reference point is not a static feature of the model but the average test score (16.77 or 80%). The plot of the variance between scores can be attached to the right side of the math model (Chart 67).

The variance within columns and rows spreads across the static frame of the model. The model then adjusts to fit the variance between scores (rows) to match the spread of the active within rows.

I can see another interpretation of the model variance if the dome is inverted as a depression. As a flight instrument on a blimp: pitch, roll, and yaw (within item, 2.96; within score, 3.31; and between scores, 4.10) the blimp would have the nose up, rolled to the side, and with the rudder hard over.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Test Scoring Math Model - Input

2014-02-19T03:00:00.000-08:00

The mathematical model (Table 25) in the previous post relates all the parts of a traditional item analysis including the observed score distribution, test reproducibility, and the precision of a score. Factors that influence test scores can be detected and measured by the variation between and within selected columns and rows.

The model is only aware of variation within and between mark patterns (deviations from the mean). The variance (the sum of squared deviations from the mean divided by the number summed or the mean sum of squares or MSS) is the property of the data that relates the mark patterns to the normal distribution. This permits generating useful descriptive and predictive insights.

The deviation of each mark from the mean is obtained by subtracting the mean from the value of the mark (Table 25a). The squared deviation value is then elevated to the upper floor of the model (Step 1, Table 25b). [Un-squared deviations from the mean would add up to zero.]

[IF YOU ARE ONLY USING MULTIPLE-CHOICE TO RANK STUDENTS, YOU MAY WANT TO SKIP THE FOLLOWING DISCUSSION ON THE MEANING OF TEST SCORES WHEN USED TO GUIDE INSTRUCTION AND STUDENT DEVELOPMENT.]

The model’s operation gains meaning by relating the score and item mark distributions to a normal distribution. It compares observed data to what is expected from chance alone or as I like to call it, the know-nothing mean.

The expected know-nothing mean based on 0-wrong and 1-right with 4-option items (popular on standardized tests) is centered on 25%, 6 right out of 24 questions (Chart 62). This is from luck on test day alone (students only need to mark each item; they do not need to read the test) on a traditional multiple-choice test (TMC). The mean moves to 50% if student ability and item difficulty have equal value. It moves to 80% if students are functioning near the mastery level as seen in the Nursing124 data. The math model will adjust to fit these data.

The know-nothing mean, with Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM), is at 50% for a high quality student or 25% for a low quality student (same as TMC). Scoring is 0-wrong, 1-have yet to learn, and 2-right. A high quality student accurately, honestly, and fairly reports what is trusted to be useful in further instruction and learning. There are few, if any, wrong marks. A low quality student performs the same on both methods of scoring by marking an answer on all items. Students adjust the test to fit their preparation.

The know-nothing mean for Knowledge Factor (KF) is above 75% (near the mastery level in the Nursing124 data, violet). KF weights knowledge and judgment as 1:3, rather than 1:1 (KJS) or 1:0 (TMC). High-risk examinees do not guess. Test takers are given the same opportunity as teachers and test makers to produce accurate, honest, and fair test scores.

The distribution of scores about the know-nothing mean are the same for TMC (green, Chart 63) and KJS (red, Chart 63). An unprepared student can expect, on average, a score of 25% on a TMC test with 4-option items. Some 2/3 of the time the score will fall within +/- 1 standard deviation of 25%. As a rule of thumb, the standard deviation (SD) on a classroom test tends to be about 10%. The best an unprepared student can hope for is a score over 35% (25 + 10) about 1/6 of the time ((1 - 2/3)/2).

The know-nothing mean (50%) for KJS and the PCRM is very different from TMC (25%) for low quality students. The observed operational mean at the mastery level (above 80%, violet) is nearly the same for high quality students electing either method of scoring. High quality students have the option of selecting items they can trust they can answer correctly. There are few to no wrong marks. [Totally unprepared high quality students could elect to not mark any item for a score of 50%.]

The mark patterns on the lower floor of the mathematical model have different meanings based on the scoring method. TMC delivers a score that only ranks the student’s performance on the test. KJS and the PCR deliver an assessment of what a student knows or can do that can be trusted as the basis for further learning and instruction. Quantity (number right) and quality (portion marked that are right) are not linked. Any score below 50% indicates the student has not developed a sense of judgment needed to learn and report at higher levels of thinking.

The score and item mark patterns are fed into the upper floor of the mathematical model as the squared deviation from the mean (d^2). [A positive deviation of 3 and a negative deviation of 3 both yield a squared deviation of 9.] The next step is to make sense of (to visualize, to relate) the distributions of the variance (MSS) from columns and rows.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Test Scoring Mathematical Model

2014-02-05T03:00:00.000-08:00

The seven statistics reviewed in previous posts need to be related to the underlying mathematics. Traditional multiple-choice (TMC) data analysis has been expressed entirely with charts and the Excel spreadsheet VESEngine. I will need a TMC math model to compare TMC with the Rasch model IRT that is the dominant method of data analysis for standardized tests.

A mathematical model contains the relationships and variables listed in the charts and tables. This post applies the advice in learning discussed in the previous post. It starts with the observed variables. The mathematical model then summarizes the relationships in the seven statistics.

The model contains two levels (Table 25). The first floor level contains the observed mark patterns. The second floor level contains the squared deviations from the score and item means; the variation in the mark patterns. The squared values are then averaged to produce the variance. [Variance = Mean sum of squares = MSS]

1. Count

The right marks are counted for each student and each item (question). TMC: 0-wrong, 1-right captures quantity only. Knowledge and Judgment Scoring (KJS) and the partial credit Rash model (PCRM) capture quantity and quality: 0-wrong, 1-have yet to learn this, 2-right.

Hall JR Count = SUM(right marks) = 20

Item 12 Count = SUM(right marks) = 21

2. Mean (Average)

The sum is divided by the number of counts. (N students, 22 and n items, 21)

The SUM of scores / N = 16.77; 16.77/n = 0.80 = 80%

The SUM of items / n = 17.57; 17.57/N = 0.80 = 80%

3. Variance

The variation within any column or row is harvested as the deviation between the marks in a student (row) or item (column) mark pattern, or between student scores, with respect to the mean value. The squared deviations are summed and averaged as the variance on the top level of the mathematical model (Table 25).

Variance = SUM(Deviations^2)/(N or n) = SUM of Squares/(N or n) = Mean SS = MSS

4. Standard Deviation

The variation within a score, item, or probability distribution expressed as a normal value that +/- the mean includes 2/3 of a normal, bell-shaped, distribution: 1 Standard Deviation = 1SD.

SD = Square Root of Variance or MSS = SQRT(MSS) = SQRT(4.08) = 2.02

For small classroom tests the (N-1) SD = SQRT(4.28) = 2.07 marks

The variation in student scores and the distribution of student scores are now expressed on the same normal scale.

5. Test Reliability

The ratio of the true variance to the score variance estimates the test reliability: the Kuder-Richardson 20 (KR20). The score (marginal column) variance – the error (summed from within Item columns) variance = the true variance.

KR 20 = ((score variance – error variance)/score variance) x n/1-n)

KR 20 = ((4.08 – 2.96)/4.08) x 21/20 = 0.29

This ratio is returned to the first floor of the model. An acceptable classroom test has a KR20 > 0.7. An acceptable standardized test has a KR20 >0.9.

6. Traditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall is the standard error of measurement (SEM). The traditional SEM is based on the average performance of your class: 16.77 +/- 1SD (+/- 2.07 marks).

SEM = SQRT(1-KR20) * SD = SQRT(1- 0.29) * 2.07 = +/-1.75 marks

On a test that is totally reliable (KR20 = 1), the SEM is zero. You can expect to get the same score on a retest.

7. Conditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall based on the rank of your test score alone (conditional on one score rank) is the conditional standard error of measurement (CSEM). The estimate is based (conditional) on your test score rather than on the average class test score.

CSEM = SQRT((Variance within your Score) * n number of questions) = SQRT(MSS * n) = SQRT(SS)

CSEM = SQRT(0.15 * 21) = SQRT(3.15) = 1.80 marks

The average CSEM values (1.75) for all of your class (light green) also yields the test SEM. This confirms the above calculation for 6. Traditional Standard Error of Measurement for the test.

This mathematical model (Table 25) separates the flat display in the VESEngine into two distinct levels. The lower floor is on a normal scale. The upper floor isolates the variation within the marking patterns on the lower floor. The resulting variance provides insight into the extent that the marking patterns could have occurred by luck on test day and into the performance of teachers, students, questions, and the test makers. Limited predictions can also be made.

Predictions are limited using traditional multiple-choice (TMC) as students have only two options: 0-wrong and 1-right. Quantity and quality are linked into a single ranking. Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM) separate quantity and quality: 0-wrong, 1-have yet to learn, and 2-right. Students are free to report what they know and can do accurately, honestly, and fairly.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Test Scoring Myths for Students

2014-01-29T03:00:00.000-08:00

The best test is a test that permits you to accurately, honestly, and fairly report what you know and can do. You know how to question, to get answers, and to verify. You know what you know and what you have yet to learn. This operates at two levels of thinking. It is a myth that a forced choice multiple-choice test measures what you trust you know and can do.

At the beginning of any learning operation, you learn to repeat and to recall. Next you learn to relate the bits you can repeat and recall. By the end of a learning operation you have assembled a web of skills and relationships. You start at lower levels of thinking and progress to higher levels of thinking. Practice takes you from slow conscious operations to fast automatic responses (multiplication or roller skating). It is a myth that learning primarily occurs only by responding to a teacher in a classroom.

Your attitude during learning and testing is important. Your maturity is indicated by your ability to get interested in new topics or activities your teacher recommends (during the course). As a rule of thumb, a positive attitude is worth about one letter grade on a test. It is a myth that you can easily learn when you have a negative attitude.

Your expectations are important. You tend to get what you expect. A nine year study with over 3000 students indicated that students tend to get the grade they expected at the time they enrolled in the class, based on their lack of information, misinformation, and attitude. It is a myth that you cannot do better than your preconceived grade.

Learning and testing are one coordinated event when you can see the result of your practicing directly (target practice or skateboarding). This situation also occurs when you are directly tutored by a person or by a person’s software. It is a myth that you must always take a test separately from learning.

Complex learning operations go though the same sequence of learning steps. The rule of three applies here. Read or practice from one source to get the basic terms or actions. Read or practice from a second set to add any additional terms or actions. Read or practice from a third set to test your understanding, your web of knowledge and skill relationships. It is a myth that you must always have another person test your learning (but another person can be very helpful).

That other person is usually a teacher who cannot teach and test each pupil or student individually. The teacher also selects what is to be learned rather than letting you make the choice. The teacher also selects the test you will take. It is a myth that your teachers have the qualities needed to introduce you to the range of skills and knowledge required for an honest, self-supporting citizen.

Teaching usually takes place during scheduled time periods. In extreme situations, only what is learned in those scheduled time periods will be scored. This is one basis for assessing teacher effectiveness. It is a myth that the primary goal of traditional schools is student learning and development.

Traditional multiple-choice is defective. It was crippled when the option of no response, “do not know”, was eliminated when adapted from its use with animal experiments to make classroom scoring easier. It is a myth that you should not have this option to permit accurate, honest, and fair assessment.

Traditional multiple-choice promotes selecting the best right answer: using the lowest levels of thinking. The minimum requirement is making a mark for each question. It is a myth that such a score measures what you know or can do. The score ranks you on the test.

The average test score describes the test, not you. (Table 15 or Download)

Your score may rank you above or below average. It is a myth that you will always be safe with an above average score (passing).

The normal distribution of multiple-choice test scores is based on your luck on test day. The normal distribution is desired for classes in schools designed for failure. It is a myth that a class should not have an average score of 90%.

Luck on test day will distribute 2/3 of your classmates’ multiple-choice scores within the bubble in the center of a normal distribution; that is one standard deviation (SD) from the average. (Table 15 or Download) [SD = SQRT(Variance) and the Variance = SUM(Deviation from the Average^2)/N = Mean Sum of Squares = MSS]

Your grade (cut score) is set by marking off the distribution of classmate scores in standard deviations: F (<-2 b="" c="" d="" to="">+1); A (>+2). Your raw score grade is the sum of what you know and can do, your luck on test day, and your set of classmates.

Raw scores can be adjusted by shifting their distribution, higher or lower, and by stretching (or shrinking) the distribution to get a distribution that “looks right”. It is a myth that your teacher, can only select the right mix of questions, to get a raw score distribution that “looks right”.

Some questions perform poorly. They can be deleted and a new, more accurate, scored distribution created. It is a myth that every question must be retained.

Discriminating questions are marked right only by high scoring classmates and marked wrong by low scoring classmates. (Table 15 or Download) It is a myth that all questions should be discriminating.

Discriminating questions produce your class raw score distribution. About 5 to 10 are needed to create the amount of error that yields a range of five letter grades. It is a myth that discriminating questions assess mastery.

The reliability (reproducibility, precision) of your raw score can be predicted, but not your final (adjusted) score. Test reliability (KR20) is based on the ratio of variation (the variance) from between student scores (external column) and within question difficulty mark patterns (internal columns). (Table 15 or Download)

This makes sense: The smaller the amount of error variance within the question difficulty internal columns, with respect to the variance between student scores in the external column, the greater the test reliability. Discriminating, difficult, questions spread out student scores more (yield higher variance) than they increase the error variance within the questions. If there were no error variance, a test would be totally reliable (KR20 = 1). It is a myth that a good informative test must maximize reliability.

The test reliability can help predict the average test score your class would get if it were to take another test over the same set of skills and knowledge. The Standard Error of Measurement (SEM) of your test is the range of error (from all of the above effects) for the average test score. (Table 15 or Download) The SD of the test and the test reliability are combined to obtain the SEM. The test reliability extracts a portion of the SD. If the test reliability were 1 (totally reliable), the SEM would be 0 (no error), the class would be expected to get the same class test score on a retest.

And finally what can you expect about the precision of your score and your retest score (providing you have not learned any more). A retest is of critical importance to students needing to reach a high stakes cut score. If the SEM or CSEM ranges widely enough, you do not need to study. Just retake the test a couple of times and your luck on test day may get you a passing score. It is a myth that the probability, of you getting a passing grade 2/3 of the time, will insure you get the passing grade if you need a second trial.

The Conditional [on your raw score] Standard Error of Measurement (CSEM) extracts the variance from only your mark pattern (Table 22). [CSEM = SQRT(Variance within your marks X the number of questions] Your CSEM will be very small if you have a very high or low score. This limits the prospects of a passing score by retaking a test without studying.

Now to study, to change testing habits, or to trust to luck on test day, before a retest. Get a copy of the blueprint used in designing the test. A blueprint lists in detail what will be covered and the type of questions. Question each topic or skill. It is easier to answer questions other people have written if you have already created and answered your own questions. Use the advice in the first five paragraphs above and work up into higher level of thinking, meaning making (a web of relationships that makes sense to you and visualize, sketch, draw, every term).

A change in testing habits may also be in order. Many students who do not “test well” are bright, fast memorizers, but lacking in meaningful relationships that make sense to themselves. They are still learning for someone else: the test and scanning each question for the “one right answer”. With meaningful relationships in mind you have the information in hand to answer a number of related questions. You are not limited to just matching what you recall to the question answers. [Mark out wrong answers and guess from the remaining answers.]

And now for the “Hail Mary” approach. First, as a rule of thumb, your score on a test written by someone other than your teacher (a standardized test for example) will be one to two letter grades below your classroom test scores. If your failing test score is within 1 SEM of the cut score, you can expect a retest score within this range 2/3 of the time. The same prediction is made with your CSEM value that can range above and below the SEM value. If your failing test score is below 1 SEM or 1 CSEM from the cut score, you have no option other than to study. It is a myth that students passing a few points above the cut score will also pass on a retest. [Near passes are safe. Near failures are not.]

Also please keep in mind that all of the math dealing with the variation between and within columns and rows (the variance) can be done on the student and question mark patterns with no knowledge of the test questions or the students. It is a myth that good statistical procedures can improve poor question or student performance. Teacher and psychometrician judgment on the other hand can do wonders!

The standardized test paradox: A good blueprint to guide calibrated question selection for the test is the basis for low scores and a statistically reliable test. Good student preparation is the basis for high scores (mastery) and a statistically unreliable test (it cannot spread student scores out enough for the distribution to “look right”).

The sciences, engineering, and manufacturing use statistics to reduce error to a minimum (low maintenance cars, aircraft, computers, and telephones). Only in traditional institutionalized education (schools designed for failure) is error intentionally introduced to create a score range that “looks right” for setting grades and ranking schools. This is all non-sense for schools designed for mastery (who advance students after they are prepared for the next steps). It is a myth (and an entrenched excuse for failure by the school) that student score distributions must fit a normal, bell-shaped, curve of error.

Mastery schools are now being promoted as the burden of record keeping is easily computerized. The Internet makes mastery schools available everywhere and at anytime. This will have a marked change in traditional schooling in the next few years. This change can be seen in the “flipped” classroom (a modern version of assigned [deep] reading before class discussion). It is a myth that the “flipped” classroom is something new.

Current educational software removes the time lag, in the question-answer-and-verify learning cycle, introduced by grouping students in classes, and then extended with standardized tests. Learning and assessment are again joined to promote mastery of assigned skills and knowledge. Students advance when they are ready to succeed at the next levels. It is a myth that “formative assessments” are actually functional when test results are not available in an operational time frame (seconds to a few days).

Standardized tests will continue to rank students and schools, as the tests mature to certifying mastery for students who learn and excel anywhere and at anytime. It is a myth that current substantive standardized tests (that do not let students report what they trust they know or can do) can “pin point exactly what a student knows and needs to learn”.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

The Value and Meaning of a Mark

2013-11-06T04:00:00.000-08:00

The bet in the title of Catherine Gewertz’s article caught my attention: “One District’s Common-Core Bet: Results Are In”. As I read, I realized that the betting that takes place in traditional multiple-choice (TMC) was being given arbitrary valuations to justify the difference between a test score and a classroom observation. If the two agreed, that was good. If they did not agree, the standardized test score was dismissed.

TMC gives us the choice of a right mark and several wrong marks. Each is traditionally given a value of 1 or 0. This simplification, carried forward from paper and pencil days, hides the true value and the meanings that can be assigned to each mark.

The value and meaning of each mark changes with the degree of completion of the test and the ability of the student. Consider a test with one right answer and three wrong answers. This is now a popular number for standardized tests.

Consider a TMC test of 100 questions. The starting score is 25, on average. Every student knows this. Just mark an answer to each question. Look at the test and change a few marks, that you can trust you know, to right. With good luck on test day, get a score high enough to pass the test.

If a student marked 60 correctly, the final score is 60. But the quality of this passing score is also 60%.

Part of that 60% represents what a student knows and can do, and part is luck on test day. A passing score can be obtained by a student who knows or can do less than half of what the test is assessing; a quality below 50%. This is traditionally acceptable in the classroom. [TMC ignores quality. A right mark on a test with a score of 100 has the same value, but not the same meaning as a right mark on a test with a score of 50.]

A wrong mark can also be assigned different meanings. As a rule of thumb (based on the analysis of variance, ANOVA; a time honored method of data reduction), if fewer than five students mark a wrong answer to a question, the marks on the question can be ignored. If fewer that five students make the same wrong mark, the marks on that option can be ignored. This is why Power Up Plus (PUP) does not report statistics on wrong marks, but only on right marks. There is no need to clutter up the reports with potentially interesting, but useless and meaningless information.

PUP does include a fitness statistics not found in any other item analysis report that I have examined. This statistic shows how well the test fits student preparation. Students prepare for tests; but test makers also prepare for the abilities of test takers.

The fitness statistic estimates the score a student is expected to get if, on average, as many wrong options are eliminated as are non-functional on the test, before guessing; with NO KNOWLEDGE of the right answer. This is the best guess score. It is always higher than the design score of 25. The estimate ranged from 36% to 53%, with a mean of 44%, on the Nursing124 data. Half of these students were self-correcting scholars. The test was then a checklist of how they were expected to perform.

With the above in mind, we can understand how a single wrong mark can be devastating to a test score. But a single wrong mark, not shared by the rest of the class can be taken seriously or ignored (just as a right mark, on a difficult question, by a low scoring student).

To make sense of TMC test results requires both a matrix of student marks and a distribution of marks for each question (Break Out Overview). Evaluating only an individual student report gives you no idea whither a student missed a survey question that every student was expected to answer correctly or a question that the class failed to understand.

Are we dealing with a misconception? Or a lack of performance related to different levels of thinking in class and on the test; or related to the limits of rote memory to match an answer option to a question? [“It’s the test-taking.”] When does a right mark also mean a right answer or just luck on test day? [“This guy scored advanced only because he had a lucky day.”]

Mikel Robinson, as an individual, failed the test by 1 point. Mikel Robinson, as one student in a group of students, may not have failed. [We don’t really know.] His score just fell on the low side of a statistical range (the conditional standard error of measurement; see a previous post on CSEM). Within this range, it is not possible to differentiate one student’s performance from another’s using current statistical methods and a TMC test design (students are not asked if they can use the question to report what they can trust they actually know or can do).

We can say, that if he retook the test, the probability of passing may be as high as 50%, or more, depending upon the reliability and other characteristics of the test. [And the probability of those who passed by 1 point, of then failing by one point on a repeat of the test, would be the same.]

These problems are minimized with accurate, honest, and fair Knowledge and Judgment Scoring (KJS). You can know when a right mark is a right answer using KJS or the partial credit Rasch model IRT scoring. You can know the extent of a student’s development: the quality score. And, perhaps more important, is that your students can trust what they know and can do too; during the test, as well as after the test. This is the foundation on which to build further long lasting learning. This is student empowerment.

Welcome to the KJS Group: Please register at mailto:KJSgroup@nine-patch.com. Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

FOR SALE: raschmodelaudit.blogspot.com/2013/10/knowledge-and-judgment-scoring-kjs-for.html

Growth Mindset

2013-10-30T04:00:00.000-07:00

The article by Sarah D. Sparks, http://www.edweek.org/ew/articles/2013/09/11/03mindset_ep.h33.html?r=545317799, starts with a powerful concept: “It’s one thing to say all students can learn, but making them believe it – and do it – can require a 180-degree shift in student’s and teacher’s sense of themselves and of one another.”

The General Studies Remedial Biology course I taught faced this challenge. The course was scheduled at night for three consecutive hours in a 120-seat lecture room. I refused to teach the course until the following arrangements were made:

The entire text was presented by cable online reading assignments in each dormitory room and by off-campus phone service.
One hour was scheduled for my lecture, after any student presentations related to the scheduled topic.
One hour was scheduled for written assessment every other week.
One hour was scheduled for 10-minute student oral reports based on library research, actual research, or projects.

Students requested the assessment period be placed in the first hour instead of the second hour, after the first few semesters. This turned the course into a seminar for which students needed to prepare on their own before class.

Only Knowledge and Judgment Scoring (KJS) was used the first few semesters, with ready acceptance by the class. The policy of bussing in students from out of the Northwest Missouri region brought in protestors, “Why do we have to know what we know, when everywhere else on campus, we just mark, and the teacher tells us how many right marks we made?”

Offering both methods of scoring, traditional multiple-choice (TMC) and KJS, on the same test solved that problem. Students could select the method they felt most comfortable with; that matched their preparation the best.

The student presentations and reports were excellent models for the rest of the class. They showed the interest in the subject and the quality of work these students were doing to the entire class.

KJS provided the information needed to guide passive pupils alone the path to becoming self-correcting scholars. As a generality, that path took the shape of a backward J. First they made fewer wrong marks, next they studied more, and finally they switched from memorizing non-sense to making sense of each assignment.

Over time they learned they were now spending less time studying (reviewing everything) and getting better grades by making sense as they learned; they could actually build new learning on what they could trust they had learned. They could monitor their progress by checking their quality score and their quantity score. Get quality up, interest and motivation increase, and quantity follows.

The tradition of students comparing their score with that of the rest of the class to see if they were safe, or needed to study more, or had a higher grade than expected when enrolling in the course (and could take a vacation), was strong in the fall semester with the distraction of social groups, football and homecoming. The results of fall and spring semesters were always different.

There was one dismal failure. With the excellent monitoring of their progress in the course, the idea was advanced to recognize class scholars. These students, had in one combination or another of test scores and presentations, earned a class score that could not be changed by any further assessment. They had demonstrated their ability to make sense of biological literature (the main goal of the course, which, hopefully, would serve them well the rest of their lives, as well as, the habit of making sense of assignments in their other courses). The next semester all went as planned. Most continued in the class and some conducted study sessions for other students.

The following semester witnessed an outbreak of cheating. Today, Power Up Plus (PUP) gets its name by the original cheat checker added to Power UP. Cheating became manageable by the simple rule that any answer sheet that failed to pass the cheat checker would receive a score of zero. I offered to help any student who wished to protest the rule to the student disciplinary committee. No student ever protested.

[Cheating was handled in-class as any use of the university rules was not honored by the administration. You must catch individual students in the act. Computer cheat checkers had the same status as red light cameras do now. If more than one student is caught, the problem is with the instructor, not with the student. We cancelled the class scholar idea.]

We need effective tools to manage student “growth mindset”. The tools must be easy to use by students and faculty. Students need to see how other students succeed, to be comfortable in taking part, and be able to easily follow their progress when starting at the low end of academic preparation of knowledge, skills, and judgment (quality, the use of all levels of thinking).

A common thread runs through successful student empowerment programs: Effective instruction is based on what students actual know, can do, and want to do or to take part in. This requires frequent appropriate assessment at each academic level such as, in general, these recent examples:

Elementary School http://smartblogs.com/education/2013/09/25/closing-the-achievement-gap-in-a-high-poverty-school/
Middle School http://www.edweek.org/ew/articles/2013/09/11/03common_ep.h33.html
High School http://www.edweek.org/ew/articles/2013/09/11/03mindset_ep.h33.html?r=545317799
College and wherever multiple-choice is used for accurate, honest, and fair assessments http://www.nine-patch.com

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

FOR SALE: raschmodelaudit.blogspot.com/2013/10/knowledge-and-judgment-scoring-kjs-for.html

Alternative Multiple-Choice Origins

2013-10-23T04:00:00.000-07:00

Two alternative forms of multiple-choice (AMC) to the traditional multiple-choice (TMC) developed from independent sources. Geoff Masters from Melbourne, Australia, is credited as the developer of the parcel credit Rasch model (PMC), a form of Information Response Theory (IRT) analysis in 1982 (Bond and Fox). It allows students to report what they know (2 points), what they do not know (1 point), and wrong answer (0 points). It never became popular on classroom or standardized tests.

The second form of AMC was developed at NWMSU. It started as net yield scoring (NYS) on both essay and multiple-choice. I needed a way to reduce the amount of reading required in scoring “blue book” essays. A 20-point essay started with 10 points. A point was added for acceptable, related, information bits. A point was subtracted for unacceptable, incorrect, unrelated information bits. An information bit was basically a short sentence with correct grammar and spelling. It could also be a relationship expressed as a diagram, sketch, or drawing.

This reduced the amount of reading by more than a 1/3 and improved student performance. Snow, filler, and fluff had no value but distracted a student from doing good work. Students needed to exercise good judgment in selecting what they wrote. This was no longer the case of their writing, and the teacher searching, for something that could earn them sufficient credit to pass the course; a lower level of thinking operation that is very common in high schools and colleges. NYS required students to use good judgment as well as be knowledgeable and be skilled.

This same idea was applied to computer scored multiple-choice tests with interesting results. When both TMC and NYS were offered on the same test, most students selected TMC on their first test. This is what they were familiar with. Over 90% of students elected NYS on their third test. Students also agreed that knowledge and judgment should have equal value.

By 1981 NYS was renamed knowledge and judgment scoring (KJS) to reflect what was being assessed: good judgment and a right answer (2 points), good judgment to report what has yet to be learned with no mark (1 point), and poor judgment, a wrong mark (0 points).

KJS requires and rewards students for using higher levels of thinking. The quality score is independent from the right count score. A struggling student with a test score of 60% may have also earned a quality score of 90%.

With TMC there is no way of knowing what a student with a score of 60% actually knows (when a right mark is a right answer or just luck on test day). With KJS we can know what this student knows with the same degree of accuracy as a student earning a 90% score on a TMC test.

More importantly, this reinforces the student’s sense of self-judgment and encourages effort to do better. It is the equivalent to the note a teacher marks on a special paragraph in an essay, “Good work!”

KJS provides the information needed to tell student and teacher what has been learned and what has yet to be learned in an easy to use report. Often a trail of bi-weekly test scores would follow a backward J. Reducing guessing by itself did not increase the test score but moved the score to a higher quality. Low quality students needed to change study habits. Low scoring high quality students needed to study more.

Learning by questioning and establishing relationships provided students the basis for answering question correctly that they had never seen before. They then stumbled onto what I meant by, “Make things meaningful (full of relationships) if your learning is to be really useful, empowering and easy to remember”. They did not have to review everything for each cumulative test.

The most interesting finding was that when students mastered meaning-making, they found themselves doing better in all of their courses. This is what inspired me to continue to promote Knowledge and Judgment Scoring. Students learn best when they are in charge. The quality score was the “feel good” score for struggling students until their improving development produced the high scores earned by successful self-correcting students.

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Knowledge and Judgment Scoring - Operational to Instructional

2013-10-16T04:00:00.000-07:00

This post (and the next three) introduce why we need a KJS Group. The software, Power Up Plus (PUP), that contains both Knowledge and Judgment Scoring (KJS) and traditional multiple-choice (TMC) is now free to registered KJS Group members. Version 5.22, is free to teachers and administrators. Please see instructions below.

This reflects a change in use of the software as an operational program for scoring individual classroom tests, to use as an instructional program to promote student and teacher development in preparation for the CCSS movement assessments. Students and teachers can readily see the difference between lower and higher levels of thinking when students are offered the opportunity to report, in a non-threatening environment, what they actually trust they know and can do, that serves as the basis for further learning and instruction. Practice riding the tricycle is poor preparation for a riding test on a bicycle.

Last week I finished a series of 22 posts on this Multiple-Choice Reborn blog. The series makes clear, that no amount of “statistical work” can extract from TMC marked answer sheets, some of the claims now being marketed about them. These tests can, at best, only do a good job of ranking students.

They so imperfectly and incompletely tell us what students know and can do that North Carolina is now spending six months figuring out how and where to place the cut scores on their new CCSS traditionally scored end-of-grade, multiple-choice math test results.

[They must guess where to put the cut score on the results from uncommitted, low scoring, improperly prepared students, who were guessing at the right answers to questions the test maker guessed, would produce a satisfactory score distribution, with high statistical reliability and precision. The more nonsensical the student mark data are, the more subjective the process.]

Accurate, honest, and fair testing can be done with Knowledge and Judgment Scoring and the partial credit Rasch model analysis. These methods allow students to report what they actually know and can do that is meaningful, useful, and empowering. Student development (the judgment to appropriately use all levels of thinking) is as important as knowledge and skills for successful students and employees (Knowledge Factor).

The NCLB decade has laid the foundation for real change by making schools designed for failure (that promote students beyond their abilities, rather than developing the necessary abilities for their success) so bad and so visible, that something had to be done. The CCSS movement has rekindled the old alternative (to TMC) testing and authentic testing methods; with the addition of CAT and elaborate assessment methods.

My concern now is that, after expending a large amount of time and money on promoting the CCSS movement ideals, a major part of the assessments will once again be reduced back again to traditional guess testing at the lowest levels of thinking.

Both KJS and TMC scoring can use the same test questions. In fact both methods are used on the same test to accommodate students working at all levels of thinking and with all degrees of preparation (PUP).

IMHO, KJS is a practical method of achieving the CCSS movement goals. It prepares students for standardized tests presented at all levels of thinking. [I still cannot predict when KJS or the partial credit Rash model will be used on standardized tests as current standardized tests are not designed to assess what students know or can do. They are designed, using the fewest questions, to produce an acceptable spread of student scores.]

Rather than a rank of 60 on a test, a student may get a quality score of 90% on questions used to report what the student actually knows and can do, as well as, a rank of right marks on the test using KJS. We now know what a “just passing” student knows with the same accuracy as a student earning a 90% score on a traditional test. This can be valuable formative assessment information.

Letting students tell us what they know or can do makes more sense than the guessing game now in use during preparation and assessment. And over 90% of my students preferred Knowledge and Judgment Scoring after just two experiences with it. Even students like an honest and fair test over gambling for a grade.

Past performance in my classroom is no guarantee of performance in your classroom unless you are a likeminded teacher, administrator, or test maker.

[The Educational Software Cooperative, Inc. (non-profit) closed this year (2013) after 20 years of operation during which I was the volunteer treasurer. It was founded to maximize the benefits of an individual computer: infinite patience, non-judgmental, and best of all, instant formative feedback. That level of instruction and record keeping has now been surpassed by the necessity for district wide record keeping systems operating online assessments keyed to CCSS learning objectives.]

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS: PUP522xlsm.zip, 606 KB or PUP522xls.zip, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Multiple-Choice Test Analysis - Summary

2013-10-09T04:00:00.000-07:00

The past 21 posts have explored how classroom and standardized tests are traditionally analyzed. The six most commonly used statistics are made fully transparent in Post 10, Table 15, the Visual Education Statistics Engine (VESE) [Free VESEngine.xlsm or VESEngine.xls]. One more statistic was added for current standardized tests. Numbers must be meaningful, understood; to have valid, practical value.

Count: The count is so obvious that it should not be a problem. But it is a problem in education. Counting right marks is not the same as counting what a student knows or can do. Also a cut score is often set by selecting a point in a range from 0% to 100%. A cut score of 50 means 50%. But the test, when administered as traditional multiple-choice starts each student at 25% with 4-option questions. [There is no way to know what low scoring students know, only their rank.]
Average: Add up all of the individual student scores and divide by the number of students for the class or test average score. [There is no average student.] Classes or tests can be compared by their averages just as students can be compared by their counts or scores.
Standard Deviation (SD): Theoretically, 2/3 of the counts on a distribution of scores are expected to fall within one SD of the average. A very well prepared (or very under prepared) class will yield a small SD. A mixed class will yield a large SD with students with both very high and very low scores (many A-B and D-F, with few C grades).
Item Discrimination: A discriminating question groups those who know (high scoring students) into one group and those who do not know (low scoring students) into another group. Every classroom test needs about ten of these to produce a grade distribution where one SD is ten percentage points (a ten point range for each grade).
Test Reliability: A test has high reliability when the results are highly reproducible. Standardized tests, therefore, use only discriminating questions. They rarely ask a question that almost all students can answer correctly. Traditional multiple-choice, therefore, does not assess what students actually know and value. Traditional standardized tests can only rank students.
Standard Error of Measurement (SEM): Theoretically, 2/3 of the time a student retakes the same test; the scores are expected to fall within one SEM of the average. The SEM value fits inside the range of the SD. “Jimmy, you failed the test, but based on your test score and your luck on test day, each time you retake the test, you have a 20% expectation of passing without doing any more studying.” The SEM precision is based on the reliability of the entire test.
Conditional Standard Error of Measurement (CSEM): The CSEM is based (conditioned) on each test score. This refinement in precision is a recent addition to traditional multiple-choice analysis. It has been a part of the Rasch model IRT analysis for decades.

Even the CSEM cannot clean up the damage done by forcing students to mark every question even when they cannot read or do not understand the question. Knowledge and Judgment Scoring and the partial credit Rasch model do not have this flaw. Both accommodate students functioning at all levels of thinking and all levels of preparation. These two scoring methods are in tune with the objectives of the CCSS movement.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Visual Education Statistics - Conditional Standard Error of Measurement

2013-10-02T04:00:00.000-07:00

21

[[Second Pass, 8 July 2014. Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).

Also the discussion of Table 24 has been updated to match the terms and values in Table 24.]]

Working on the conditional standard error of measurement (CSEM) is new territory for me. I always associated the CSEM with the Rasch model IRT analysis commonly used by state departments of education when scoring NCLB tests. I first had to Google for basic information.

If you are interested in the details, please check out these sources for sample (n-1) equations: (Equation 6.14 that corrects the relative variance was not included in the 2005 version of the current 2008 version. This represents a significant progress in applying test precision.)

Absolute Error Variance Equation 5.39 p. 73
Relative Error Variance Equation 6.3 p. 83
Corrected Relative Variance Equation 6.14 p. 91 or GED Equation 3 p. 9

My first surprise was to find I had already calculated the CSEM for the Nursing124 data when I put up Post 5 of this series (in Table 8. Interactions with Columns [Items] Variance, MEAN SS = 3.33) as I discovered five ways to harvest the variance [mean sum of squares (MSS)]. Equation 6.3 n, Table 22, produces the same result (test SEM = 1.75) when it divides by n [unknown population] rather than n-1 [observed sample].

[n = the item count. Test SEM = AVERAGE(CSEM).]

I then used what I learned in the last post to table data to obtain the conditional error variance for student scores (Table 23a). The 21 items in Table 22 became the number of right marks on each of 11 item difficulties on Table 23a. The values in this tabulation were then converted into frequencies conditional on the student scores; the sum of which added to one, for each score (Table 23b).

The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]

The conditional frequency estimated test SEM was 1.68 (Table 23b). The conditional frequency CSEM values for each score were different for students with the same score. The CSEM values had to be averaged to get results comparable with the other analyses. These values generated an irregular curve, unlike the smooth curve for the other analyses (Chart 61). The conditional frequency CSEM analysis is sensitive to the number of items with the same difficult (yellow bars alternate for each change in value, Table 23b). The other analyses are not sensitive to item difficulty (yellow bars, in Table 22, include all students with the same score).

Complete curves were generated from Equation 6.3 for n-1 and for GED n-1 (Table 24). The GED n-1 analysis includes a correction factor (cf) for the range of item difficulties on the test [cf = (1- KR20)/(1-KR21)]. This factor is equal to one if all items are of equal difficulty. For the Nursing123 data it was 1.59; the difficulties ranged from 45% to 95%, from the middle of the total possible distribution to one extreme.

The CSEM values from the six analyses are listed in Table 24. Five are fairly close to one another. The GED n-1, with a correction for the range of item difficulties, is far different from the other five (Chart 61). Values could not be created for the full curve for conditional frequencies as you must actually have student marks to calculate conditional frequency CSEM values. The gray area shows the values calculated from an equation for which there were no actual data. Equations produce nice looking, “look right”, reports.

The CSEM improves the reportable precision on this test over using the test SEM. Good judgment (best practice) is to correct the CSEM values as done on the GED n-1 analysis.

[I did not transform the raw test score mean of 16.8 or 79.8% to a scale score of 50% as was done by Setzer, 2009, GED, p. 6 and Tables 2 and 3. The GED n-1 raw score cut point was 60% which is comparable to most classroom tests. If 25% of the score is from luck on test day that leaves 35% for what a student marked right as something known or could be done, as a worst case. If half of the lucky marks were also something the student knew or could do, the split would be about 10% for luck on test day and 50% for student ability.]

In Table 24, the GED n-1 analysis test SEM of 2.98 for the Nursing124 data is, as a range, 2.98/21 or 14.19%. For the uncorrected Equation 6.3 n-1 analysis, 1.79, the range is 1.79/21 or 8.52%. The n SEM was 1.75 or 7.95%. The n SEM range, 1.75, fits within the uncorrected n - 1 test SEM value, 1.79. The corrected GED n-1 test SEM value, 2.98, exceeds it.

Student score CSEM values are even more sensitive than the test SEM values. The maximum range for the GED n-1 analysis is 3.73 or 3.73/21 or 17.76% and for the Equation 6.3 n-1 analysis 2.35 or 11.19%. Both are beyond the maximum n CSEM value of 2.29 or 10.41%. This low quality set of data fails to qualify as a means of setting classroom grades or a standardized test cut score.

[However the classroom rule of 75% for passing the course and the rule for grades set at 10 percentage points over rule these statistics. Here is a good example that test statistics have meaning only in relation to how they are used. If the process of data reduction and reporting is not transparent, the resulting statistics are suspect and can produce extended debates over a passing score in the classroom.]

The CSEM for each student score does improve test precision. It can be calculated in several ways with close agreement. But it cannot improve the quality of the student marks on the answer sheets made under traditional, forced-choice, multiple-choice rules. These tests only rank students by the number of right marks. They do not ask students, or allow students to report, what they really know or can do; their judgment in using what they know or can do.

The CCSS movement is now promoting learning at higher levels of thinking (problem solving) with, from which I have learned, some de-emphasis on lower levels of thinking that are the foundation for higher levels of thinking. A successful student cycles through all levels of thinking, as is needed. Yet half of the CCSS testing will be at the lowest levels of thinking, traditional multiple-choice scoring. The other half will be as much of an over kill as traditional multiple-choice is an under kill in assessing student knowledge, skills, and student development to learn and apply their abilities. Others have this same concern that centralized politics (and dollars) will continue to overshadow the reality of the classroom.

There is a middle ground that makes every question function at higher levels of thinking, allows students to report what is meaningful, of value, and empowering, and has the speed, low cost, and precision of traditional multiple-choice. Knowledge and Judgment Scoring and partial credit Rasch model IRT are two examples. They both accommodate students functioning at all levels of thinking. Lower ability students do not have to guess their way through a test. With routine use, both can turn passive pupils into self-correcting highly successful achievers in the classroom. If you are really into mastery learning, you can also try something like Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):