Wednesday, May 13, 2015

Information and Reliability

How does IRT information replace CTT reliability? Can this be found on the audit tool (Table 45)?

This post relates my audit tool, Table 45, Comparison of Conditional Error of Measurement between Normal [CTT] Classroom Calculation and the IRT Model to a quote from Wikipedia (Information). I am confident that the math is correct. I need to clarify the concepts for which the math is making estimates.

Table 45
“One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally, reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score variance.”

See test reliability (a ratio), KR20, True/Total Variance, 0.29 (Table 45a).

“This index is helpful in characterizing a test’s average reliability, for example in order to compare two tests.”

The test reliability for CTT and IRT are also comparable on Table 45a and 45c, 0.29 and 0.27.

“But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the test’s range, for example, generally have more error associated with them than scores closer to the middle of the range.”

Table 46
Chart 82
See Table 45c (classroom data) and Table 46, col 9-10 (dummy data). For CTT the values are inverted (Chart 82, classroom data and Chart 89, dummy data).

Chart 89
“Item response theory advances the concept of item and test information. Information is also a function of the model parameters. For example, according to Fisher information theory, the item information supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response multiplied by the probability of an incorrect response, or,  . . .” [I = pq]. 

See Table 45c, p*q CELL INFORMATION (classroom data). Also on Chart 89, the cell variance (CTT) and cell information (IRT) have identical values (0.15) from Excel =VAR.P and from pq (Table 46, col 7, dummy data).

“The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the . . .” [1/SQRT(pq)].

Is the “test information … at a given trait level” the Score Information (3.24, red, Chart 89, dummy data) for 17 right out of 21 items? Then the reciprocal of 3.24 is 0.31, the error variance (green, Chart 89 and Table 46, col 9) in measures on a logit scale. And the IRT conditional error of estimation (SE) would be the square root: SQRT(0.31) = 0.56 in measures. And this inverted would yield the CTT CSEM: 1/0.56 = 1.80 in counts.

[[Or the SQRT(SUM(p*q)) = SQRT((0.15) * 21) = SQRT(3.24) = 1.80 (in counts) and the reciprocal is 1/1.80 = 0.56 in measures.]]

The IRT (CSEM) in Chart 89 is really the IRT standard error of estimation (SE or SEE). On Table 45c, the CSEM (SQRT) is also the SE (conditional error of estimation) obtained from the square root of the error variance for that ability level (17 right, 1.73 measures, or 0.81 or 81%).

“Thus more information implies less error of measurement.”

See Table 45c, CSEM, green, and Table 46, col 9-10.

“In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow information functions; they contribute greatly but over a narrow range. Less discriminating items provide less information but over a wider range.”

Chart 92
Table 47
The same generality applies to the item information functions (IIF)s in Chart 92 but it is not very evident. The item with a difficulty of 10 (IIF = 1.80, Table 47) is also highly discriminating. The two easiest items had negative discrimination; they show an increase in information as student ability decreases toward zero measure.  The generality applies best near the average test raw score of 50% or zero measure; which is not on the chart (no student got a score of 50% on this test).

This test had an average test score of 80%.  This has spread the item information function curves out (Chart 92). They are not centered on the raw score of 50% or the measures zero location. However each peaks near the point where item difficulty in measures is close to student difficulty in measures. This observation is critical in establishing the value of IRT item analysis and how it is used. This makes sense in measures (a natural log of the ratio of right and wrong mark scale) but not in raw 
scores (normal linear scale) as I first posted in Chart 75 with only count and percent scales.

“Plots of item information can be used to see how much information an item contributes and to what portion of the scale score range.”

This is very evident in Table 47 and Chart 92.

“Because of local independence, item information functions are additive.”

See Test SEM (in Measures), Winsteps Table 17.1 MODEL S.E. MEAN (identical) = 0.64, Table 45c)

“Thus, the test information function is simply the sum of the information functions of the items on the exam. Using this property with a large item bank, test information functions can be shaped to control measurement error very precisely.”

 “Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification.”

At this point my audit tool, Table 45, falls silent. These two mathematical models are a means for only estimating theoretical values; they are not the theoretical values nor are they the reasoning behind them. CTT starts from observed values and projects into the general environment. IRT can start with the perfect Rasch model and select observations that fit the model. The two models are looking in opposite directions. CTT uses a linear scale with the origin at zero counts. IRT sets its log ratio point-of-origin (zero) at the 50% CTT point. I must accept the concept that CTT is a simplification of IRT on the basis of authority at this point.

“In the place of reliability, IRT offers the test information function which shows the degree of precision at different values of theta, [student ability].”

I would word this, “In ADDITION to reliability,” (Table 45a, CTT = 0.29 and 45c, IRT = 0.27). Also the “IRT offers the ITEM information function which shows the degree of precision at different values . . .”

“These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of ability by including carefully chose items. For example, in a certification situation in which a test can only be passed or failed, where there is only a single “cutscore,” and where the actually passing score is unimportant, a very efficient test can be developed by selecting only items that have high information near the cutscore. These items generally correspond to items whose difficulty is about the same as that of the cutscore.”

The eleven items in Table 47 and Chart 92 each peak near the point where item difficulty in measures is close to student difficulty in measures. The discovery or invention of this relationship is the key advantage of IRT over CTT.

These data show that a test item need not have to have (a commonly recommended) average score near 50% for useable results. Any cutscore from 50% to 80% would produce useable results on this test with an average score of 80% and cutscore (passing) of 70%.

"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."

My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location). 

I do not know what part of IRT is invention and what part is discovery on the part of some ingenious people. Two basic parts had to be fit together: information and measures by way of an inversion. Then a story had to be created to market the finished product; the Rasch model and Winsteps (full and partial credit) are the limit of my experience. The unfortunate name choice of “partial credit” rather than knowledge or skill and judgment may have been a factor in the Rasch partial credit model not becoming popular. The name, partial credit, falls into the realm of psychometrician tools. The name, Knowledge and Judgment, falls into the realm of classroom tools needed to guide the development of scholars as well as obtain maximum information from paper standardized tests; where students individually customized their tests (accurately, honestly, and fairly) rather than CAT where the test is tailored to fit the student; using best-guess, dated, and questionable second hand information.

IRT makes CAT possible. Please see "Adaptive Testing Evolves to Assess Common-Core Skills" for current marketing, use, and a list of comments, including two of mine. The exaggerated claims of test makers to assess and promote deveoping students by the continued use of forced-choice lower level of thinking tests continues to be ignored in the marketing of these tests to assess Common Core skills. Increased precision of nonsense still takes precedence over an assessment that is compatible with and supports the classroom and scholarship.

Serious mastery: Knowledge Factor.
Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).
Ranking: Forced-choice on paper or CAT.

Wednesday, April 8, 2015

CTT and Rasch IRT Item Analysis Paradox

[The solution is in Chart 89, Item Analysis flow sheet.]

An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision” in “Reliability and separation of measures.” A more complete discussion is given under the title, “Standard Errors and Reliabilities: Rasch and Raw Score”.

Chart 82
The apparent paradox is graphed in Chart 82. Precision on one scale is the inverse or reciprocal of the other: 1/0.44 = 2.27 and 1/2.27 = 0.44.

Table 45
I edited Table 32 to disclose a full development of a comparison between CTT and IRT using real classroom data (Table 45). This first view is too complicated.
Chart 83
Chart 83 (CTT) and Chart 84 (IRT) summarize the statistics behind Table 45.

Chart 84
Table 45 includes the process of combining student scores and item difficulties onto one logit scale.

Table 46
I then isolated the item analysis from the complete development above by skipping the formation of a single scale from real classroom data. Instead, I feed the IRT item analysis a percent (dummy) data set (Table 46) with the same number of items as in the classroom test (21 items). I then graphed the data strings in Table 46 as a second, simpler, view of IRT item analysis.

Chart 85
Turning right counts (Chart 85, blue) into a right/wrong ratio string (red) yields a very different shape than a straight line right mark count. We now have the rate at which each mark completes a perfect score of 21 or 100%. It starts slow (1/20), with the last mark racing 20 times (20/1) the average rate (10/11 or 11/10, near 1, in Table 46, col 2).

Taking the natural log of the ratio (a logit, Table 46, col 3) creates the Rasch model IRT characteristic curve (Chart 85, purple) with the zero logit point of origin positioned at the 50% normal value. [Ratios and log ratios have no dimensions.]

Chart 86
Winsteps, at this point, has reduced student raw scores and item difficulties (in counts) into one logit scale of student ability and item difficulty with the dimension of a measure. These are then combined into the probability of a right answer to start the item analysis. The percent (dummy) input (Table 46, col 6) replaces this operation (Chart 86). This simplifies the current discussion to just item analysis and precision.

Chart 87
Percent input and Information for one central cell are plotted in Chart 87. Cell information is limited to a maximum of 0.25 at a student raw score of 50% (Table 46, col 7), when combining p*q (0.50 * 0.50 = 0.25 ). The next step is to adjust the cell information for 21 items on the test (Column 8).

Chart 88
Chart 88 completes the comparison of CTT and IRT calculations on Table 46. The inversion of Information (col 9) yields the error variance that aligns with student score measures such that the greatest precision (smallest error variance) is at the point of origin of the logit scale. The square root of the error variance (col 10) yields the CSEM equivalent for IRT measures. And then, by a second inversion these measure values are transformed into the identical normal CSEM values (col 11 - 12) for a CTT item analysis. The total view in Table 45 was too complicated. Charts 85 – 88 are also.

Chart 89
My third, simple, and last view is a flowchart (Chart 89) constructed from the above charts and tables.

The percent (dummy) data produce identical (1.80) standard error of measurement (CSEM) results with CTT and IRT item analysis (Table 46, col 11 - 12 and Chart 89) even though CTT starts with a raw score count (17), and skips the score mean (0.81), and the IRT item analysis starts with a score mean (0.81).

CTT captures the variation (in marks) within a student score in the variance (0.15); IRT captures the variation (in probabilities) as information (0.15). In all cases the score variance and score information are treated with the square root (SQRT, pink) to yield standard errors (estimates of precision: CTT CSEM, on a normal scale in counts, and IRT (CSEM) on a logit scale in measures.

In summary, as CTT score variance and IRT score information (red) increase, CSEM increases on a normal scale (Chart 89). Precision decreases.  At the same time IRT error variance (green) and IRT (CSEM) decrease on a logit scale. Precision increases with respect to the Rasch model point of origin zero (50% on a normal scale). This inversion aligns the IRT (CSEM) to student scores in measures on a logit scale.

It appears that the meaning of this depends upon what is being measured and how well it is being measured. CTT measures in counts and sets error (based on the score variance, Chart 89, red) about the student score count on a normal scale (CSEM). IRT converts counts to “measures”. IRT then measures in “measures” and sets error (based on the error variance, Chart 89, green) about the point of origin (zero) on a logit scale that corresponds to 50% on a normal scale.

Chart 90
The two methods of feeding an item analysis are using two different reference points. This was easier to see when I took the core out of Chart 88 and plotted it in a more common form in Chart 90. Precision on both scales is shown in solid black. This line intersects the Rach model IRT characteristic curve where normal is 50% and IRT is zero. At a count of 17 right, the normal scale shows higher precision; the logit scale shows lower precision in respect to the perfect Rasch model. 

The characteristic curve is a collection of points where student ability and item difficulties match resulting in students with this ability getting 50% right answers with items with matching difficulties. This situation exists for CTT only at the average test score (mean).

[The slope of the test characteristic curve is given as the inverse of the raw score error variance (3.24, red, Chart 88 - 89, and Table 46).]

Chart 91
Table 91 applies the above thinking to real classroom data (Table 45c). This time the average score was not at 50% but at 81%. The lowest student score on Table 45c was 12 (57%).

In a lost reference, I have read that at the 50% point students do not know anything; it is all chance. I can see that for true-false. That could put CTT and IRT in conflict. A student must know something to earn a score of 50% when there are four options to each item. There is a free 25%. The student must supply the remaining 25%. Also few CCT tests are filled with items that have maximum discrimination and precision. A high quality CTT test can look very much like a high quality IRT test. The difference is that the IRT test item analysis takes more into the calculations than the CTT test when offered as forced-choice (a cheap way to rank students) or as with knowledge and judgment scoring (where students report what they actually know and find meaningful and useful; the basis for effective teaching).

Historically, test reliability was the chief marketing point of standardized tests. In the past decade the precision of individual student scores has replaced test reliability. IRT (CSEM) provides a more marketable product along with promoting the sale of equipment and related CAT services. Again psychometricians on the backside are continuing to support and lend credibility to the claims from the sales office on the front end.

Wednesday, March 11, 2015

Modernizing Standardize Test Scores

A single standardized right-count score (RCS) has little meaning beyond a ranking. A knowledge and judgment score (JKS) from the same set of questions not only tells us how much the student may know or can do but also the judgment to make use of that knowledge and skill. A student with a RCS must be told what he/she knows or can do. A student with a KJS tells the teacher or test maker what he/she knows. A RCS becomes a token in a federally sponsored political game. A KJS is a base onto which students build further learning and teachers build further instruction.

Table 40. RCS
Table 41. KJS
The previous two posts dealt with student ability during the test. This one looks at the score after the test. I developed four runs of the Visual Education Statistics Engine: Table 40. RCS, Table 41. KJS (simulated), and after maximizing item discrimination, Table 42. RCSmax, and Table 43. KJSmax. 

Table 42. RCSma
Table 43. KJSmax
Test reliability and the standard error of measurement (SEM) with some related statistics are gathered into Table 44. The reliability and SEM values are plotted on Chart 81 below.

Table 44
Students, on average, can reduce their wrong marks by about one half when they at first switch to knowledge and judgment scoring. The most obvious effect of changing 24 of 48 zeros to a value of 0.5 to simulate Knowledge and Judgment Scoring (KJS) was to reduce test reliability (0.36, red). Scoring both quantity and quality also increased the average test score from 64% to 73%.

Psychometricians do not like the reduction in test reliability. Standardized paper tests were marketed as “the higher the reliability the better the test”. Marketing has now moved to “the lower the standard error of measurement (SEM), the better the test”, using computers, CAT and online testing (green). The simulated KJS shows a better SEM (10%) in relation to 12% for RCS. By switching current emphasis from test reliability to precision (SEM) KJS now shows a slight advantage to test makers over RCS.

Chart 80
Chart 80 shows the general relationships between a right-count score and a KJS. This is Chart 4/4 from the previous post tipped on its side with the 60% passing performance replaced with the average scores of 64% RMS and 73% KJS. Again, KJS is not a giveaway. There is an increase in the score, if the student elects to use his/her judgment. There is also an increase in the ability to know what a student actually knows because the student is given the opportunity to report what is known, not to just to mark an answer to every question (even before looking at the test).

Chart 81
Chart 81 expands Chart 80 using the statistics in Table 44. In general there is little difference between a right-count score and a KJS, statistically. What is different is what is known about the student; the full meaning of the score. Right-count scoring delivers a score on a test carefully crafted to deliver a desired on-average test score distribution and cut score. THE TEST IS DESIGNED TO PRODUCE THE DESIRED SCORE DISTRIBUTION. The KJS adds to this the ability to assess what students actually know and can do that is of value to them. The knowledge and judgment score assess the complete student (quantity and quality).

Knowledge and Judgment Scoring requires appropriate implementation for the maximum effect on student development. In my experience, the switch from RCS must be voluntary to promote student development. It must result in a change in the level of thinking and related study habits where the student assumes responsibility for learning and reporting. At that time students feel comfortable changing scoring methods. They like the quality score. It reassures them that they really can learn and understand.

KJS no longer has a totally negative effect on current psychometrician attempts to sharpen their data reduction tools. But there are still the effects of tradition and project size. The NCLB movement demonstrated (failed in part) because low performing schools mimicked the standardized tests rather than tended to teaching and learning. Their attempt to succeed was counterproductive. Doing more of the same does not produce different results. These schools could also be expected to mimic standardized tests offering KJS.

The current CCSS movement is based on the need for one test for all in an attempt to get valid comparisons between students, teachers, schools and states. The effect has been gigantic contracts that only a few companies have the capacity to bid on and little competition to modernize their test scoring.

KJS is then a supplement to RCS. It can be offered on standardized tests. As such, it updates the multiple-choice test to its maximum potential, IMHO. KJS can be implemented in the classroom, by testing companies and entrepreneurs who see the mismatch between instruction and assessment.

Knowledge Factor has already done this with their patented learning/assessment system, Amplifire. It can prepare students online for current standardized tests. Power Up Plus is free for paper classroom tests. (Please see the two preceding posts for more details related to student ability during the test).

Wednesday, February 11, 2015

Learning Assessment Responsibilities

Students, teachers, and test makers each have responsibilities that contribute to the meaning of a multiple-choice test score. This post extracts the responsibilities from the four charts in the prior post, Meaningful Multiple-Choice Test Scores, that compares short answer, right-count traditional multiple-choice, and knowledge and judgment scoring (KJS) of both.

Testing looks simple: learn, test, and evaluate. Short answer, multiple-choice, or both with student judgment. Lower levels of thinking, higher levels of thinking, or both as needed. Student ability below, on level, or above grade level. There are many more variables for standardized test makers to worry about in a nearly impossible situation. By the time these have been sanitized from their standardized tests all that remains is a ranking on the test that is of little if any instructional value (unless student judgment is added to the scoring).

Chart 1/4 compares a short answer and a right-count traditional multiple-choice test. The teacher has the most responsibility for the test score when working with pupils at lower levels of thinking (60%). A high quality student functioning at higher levels of thinking could take the responsibility to report what is known or can be done in one pass and then just mark the remainder for the same score (60%). The teacher’s score is based on the subjective interpretation of the student’s work. The student’s score is based on a matching of the subjective interpretation of the test questions with test preparation. [The judgment needed to do this is not recorded in traditional multiple-choice scores.]

Chart 2/4 compares what students are told about multiple-choice tests and what actually takes place. Students are told the starting score is zero. One point is added for each right mark. Wrong or blank answers add nothing. There is no penalty. Mark an answer to every question. As a classroom test, this makes sense if the results are returned in a functional formative assessment environment. Teachers have the responsibility to sum several scores when ranking students for grades.

As a standardized test, the single score is very unfair. Test makers place great emphasis on the right-mark after-test score and the precision of their data reduction tools (for individual questions and for groups of students). They have a responsibility of pointing out that the student on either side of you has an unknowable, different, starting score from chance, let alone your luck on test day. The forced-choice test actually functions as a lottery. Lower scoring students are well aware of this and adjust their sense of responsibility accordingly (in the absence of a judgment or quality score to guide them).

Chart 3/4 compares student performance by quality. Only a student with a well-developed sense of responsibility, or a comparable innate ability, can be expected to function as a high quality, high scoring, student (100% but reported as 60%). A less self-motivated student or with less ability can perform two passes at 100% and 80% to also yield 60%. The typical student, facing a multiple-choice test, will make one pass; marking every question as it comes to earn a quantity, quality, and test score of 60%; a rank of 60%. No one knows which right mark is a right answer.

Teachers and test makers have a responsibility to assess and report individual student quality on multiple-choice tests just as is done on short-answer, essay, project, research, and performance tests. These notes of encouragement and direction provide the same “feel good” effect found in a knowledge and judgment scored quality score when accompanied with a list of what was known or could be done (the right-marked questions).

Chart 4/4 shows knowledge and judgment scoring (KJS) with a five-option question made from a regular four-option question plus omit. Omit replaces “just marking”. A short answer question scored with KJS earns one point for judgment and +/-1 point for right or wrong. An essay question expecting four bits of information (short sentence, relationship, sketch, or chart) earns 4 points for judgment and +/-4 points for an acceptable or not acceptable report. (All fluff, filler, and snow are ignored. Students quickly learn to not waste time on these unless the test is scored at the lowest level of thinking by a “positive” scorer.)

Each student starts with the same multiple-choice score: 50%. Each student stops when each student has customized the test to that student’s preparation. This produces an accurate, honest and fair test score. The quality score provides judgment guidance for students at all levels. It is the best that I know of when operating with paper and pencil. Power Up Plus is a free example. Amplifire refines judgment into confidence using a computer, and now on the Internet. It is just easier to teach a high quality student who knows what he/she knows.

Most teachers I have met question the score of 60% from KJS. How can a student get a score of 60% and only mark 10% of the questions right? Easy. Sum 50% for perfect judgment, 10% for right answers, and NO wrong. Or sum 10% right, 10% right and 10% wrong, and omit 20%. If the student in the example chose to mark 10% right (a few well mastered facts) and then just marked the rest (had no idea how to answer) the resulting score falls below 40% (about 25% wrong). With no judgment, the two methods of scoring (smart and dumb) produce identical test scores. KJS is not a give-away. It is a simple, easy way to update currently used multiple-choice questions to produce an accurate, honest, and fair test score. KJS records what right-count traditional multiple-choice misses (judgment) and what the CCSS movement tries to promote.

Wednesday, January 14, 2015

Meaningful Multiple-Choice Test Scores

The meaning of a multiple-choice test score is determined by several factors in the testing cycle including test creation, test instructions, and the shift from teacher to student being responsible for learning and reporting. Luck-on-test-day, in this discussion, is considered to have similar effects on the following scoring methods.

[Luck-on-test-day includes but is not limited to: test blueprint, question author, item calibration, test creator, teacher, curriculum, standards; classroom, home, and in between, environment; and a little bit of random chance (act of God that psychometricians need to smooth their data).]                             

Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part multiple-choice, and knowledge and judgment scoring (KJS) for both short answer and multiple-choice. These range from familiar manual scoring to what is now easily done with KJS computer software. Each method of scoring has a different starting score with a different meaning. The average customary class room score of 75% is assumed (60% passing).

Chart 1/4

Open ended short answer scores start with zero and increase with each acceptable answer. There may be several acceptable answers for a single short answer question. The level of thinking required depends upon the stem of the question. There may be an acceptable answer for a question both at lower and at higher levels of thinking. These properties carry over into KJS below.

The teacher or test maker is responsible for scoring the test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4). The quality of the answers can be judged by the scorer and may influence which ones are considered right answers.

The open ended short answer question is flexible (multiple right answers) and with some subjectivity; different scorers are expected to produce similar scores. The average test score is controlled by selecting a set of items that is expected to yield an average test score of 75%. The student test score is a rank based on items included in the test to survey what students were expected to master, to group students who know from those who do not know each item, and items that fail to show mastery or discrimination (unfinished items for a host of reasons including luck-on-test-day above). 

The open ended short answer question can also be scored as a multiple-choice item. First tabulate the answers. Sort the answers from high to low count.  The most frequent answer, on a normal question, will be the right answer option. The next three ranking answers will be real student supplied wrong answer options (rather than test writer created wrong answer options). This pseudo-multiple-choice item can now be printed as a real question on your next multiple-choice test (with answers scrambled).

A high quality student could also mark only right answers on the first pass using the above test (Chart 1/4) and then finish by just marking on the second pass to earn a score of 60%. A lower quality student could just mark each item in order, as is usually done on multiple-choice tests, mixing right and wrong marks, to earn the same score of 60%. Using only a score after the test we cannot see what is taking place during the test. Turning a short answer test into traditional multiple-choice hides student quality, the very thing that the CCSS movement is now promoting.
Chart 2/4

Closed ended right-count four-option multiple-choice scores start with zero and increase with each right mark. Not really!! This is only how this method of scoring has been marketed for a century by only considering a score based on right-counts after the test is completed. In the first place traditional multiple-choice is not multiple-choice, but forced-choice (it lacks one option discussed below). This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This evil flaw in test design was countered, over 50 years ago, by a now defunct “formula scoring”. After forcing students to guess, psychometricians wanted to remove the effect of just marking! It took the SAT until March of this year, 2014, to drop this “score correction”. 

[Since there was no way to tell which right answer must be changed for the correction, it made no sense to anyone other than psychometricians wanting to optimize their data reduction tools, with disregard for the effect of the correction on the students taking such a test. Now that 4-option questions have become popular on standardized tests, a student who can eliminate one option can guess from the remaining three for better odds on getting a right mark (which is not necessarily a right answer that reflects recall, understanding, or skill).]

The closed ended right-count four-option multiple-choice question is inflexible (one right answer) and with no scoring subjectivity; all scorers yield the same count of right marks. Again, the average test score is controlled by selecting a set of items expected to yield 75% on-average (60% passing). However, this 75% is not the same as that for the open ended short answer test. As a forced-choice test, the multiple-choice test will be easier; it starts with a 25% on-average advantage. (That means one student may start with 15% and a classmate with 35%.) To further confound things, the level of thinking used by students can also vary. A forced-choice test can be marked entirely at lower levels of thinking.

[Standardized tests control part of the above problems by eliminating almost all mastery and unfinished items. The game is to use the fewest items that will produce a desired score distribution with an acceptable reliability. A traditional multiple-choice scored standardized test score of 60% is a much more difficult accomplishment than the same score on a classroom test.]

A forced-choice test score is a rank of how well a student did on a test. It is not a report of what a student actually knows or can do that will serve as the basis for further instruction and learning. The reasoning is rather simple: the forced-choice score is counted up AFTER the test is finished; this is the final game score. How the game started (25% on-average) and was played is not observed (but this is what sports fans pay for). This is what students and teachers need to know so students can take responsibility for self-corrective learning.

Chart 3/4
[Three student performances that all end up with a traditional multiple-choice score of 60% are shown in Chart 3/4. The highest quality student used two passes, “I know or can do this or I can eliminate all the wrong options” and “I don’t have a clue”. The next lower quality student used three passes, “I know or can do this”; “I can eliminate one or more answer options before marking” and “I am just marking.” The lowest level of thinking student just marks answers one pass, right and wrong, as most low quality, lower level of thinking students do. But what takes place during the test is not seen in the score made after the test. The lowest quality student must review all past work (if tests are cumulative) or continue on with an additional burden as a low quality student. A high quality student needs only to check on what has not been learned.]

Chart 4/4

Knowledge and Judgment scores start at 50% for every student plus one point for acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice). (Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right  = 2) Omitting an answer is good judgment to report what has yet to be learned or to be done (understood). Omitting keeps the one point for good judgment. An unacceptable or wrong mark is poor judgment. You lose one point for bad judgment.

Now what is hidden with forced-choice scoring is visible with knowledge and Judgment Scoring (KJS). Each student can show how the game is played. There is a separate student score for quantity and for quality. A starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge Factor sets the starting score near 75%. Judgment is far more important than knowledge in high risk occupations.]

KJS includes a fifth answer option: omit (good judgment to report what has yet to be learned or understood). When this option is not used, the test reverts to forced-choice scoring (marking one of the four answer options for every question).

A high quality student marked 10 right out of 10 marked and then omitted the remainder (in two passes through the test) or managed to do a few of one right and one wrong (three passes) for a passing score of 60% in Chart 4/4. A student of less quality did not omit but just marked for a score of less than 50%. A lower level of thinking, low quality student marked 10 right and just marked the rest (two passes) for a score of less than 40%. KJS yields a score based on student judgment (60%) or on the lack of that judgment (less than 50%).

In summary, the current assessment fad is still oriented on right marks rather than on student judgment (and development). Students with a practiced good judgment develop the sense of responsibility needed to learn at all levels of thinking. They do not have to wait for the teacher to tell them they are right. Learning is stimulated and exhilarating. It is fun to learn when you can question, get answers, and verify a right answer or a new level of understanding; when you can build on your own trusted foundation.

Low quality students learn by repeating the teacher. High quality students learn by making sense of an assignment. Traditional multiple-choice (TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards all-levels-of-thinking. TMC requires little sense of responsibility. KJS rewards (encourages) the sense of responsibility needed to learn at all levels of thinking.

1.     A short answer, hand scored, test score is an indicator of student ability and class ranking based on the scorer’s judgment. The scorer can make a subjective estimate of student quality.

2.     A TMC score is only a rank on a completed test with increased confounding at lower scores. A score matching a short answer score is easier to obtain in the classroom and much more difficult to obtain on a standardized test.

3.     A KJS test score is based on a student, self-reporting, estimate of what the student knows and can do on a completed test (quantity) and an estimate of the student’s ability to make use of that knowledge (judgment) during the test (quality). The score has student judgment and quality, not scorer judgment and quality.

In short, students who know that they can learn (get rapid feedback on quantity and quality),who want to learn, enjoy learning (see Amplifire below). All testing methods fail to promote these student development characteristics unless the test results are meaningful, easy to use by students and teachers, and timely. Student development requires student performance, not just talking about it or labeling something formative assessment.  

Power Up Plus (PUP or PowerUP) scores both TMC and KJS. Students have the option of selecting the method of scoring they are comfortable with. Such standardized tests have the ability to estimate the level of thinking used in the classroom and by each student.  Lack of information, misinformation, misconceptions and cheating can be detected by school, teacher, classroom, and student.

Power Up Plus is hosted at TeachersPayTeachers to share what was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual teachers who want to upgrade their multiple-choice tests for formative, cumulative, and exit ticket assessment. Good teachers, working within the bounds of accepted standards, do not need to rely on expensive assessments. They (and their students) do need fast, easy to use, test results to develop successful high quality students.

I hope your students respond with the same positive enthusiasm that over 90% of mine did. We need to assess students to promote their abilities. We do not need to primarily assess students to promote the development of psychometric tools that yield far less than what is marketed.

A Brief History:

Geoff Masters (1950-    )   A modification of traditional multiple-test test performance.

Created partial credit scoring for the Rasch model (1982) as a scoring refinement for traditional right-count multiple-choice. It gives partial credit for near right marks. It does not change the meaning of the right-count score (as quantity and quality have the same value by default [both wrong marks and blanks are counted as zeros], only quantity is scored). The routine is free in Ministep software.

Richard A. Hart (1930-    )   Promotes student development by student self-assessment of what each student actually knows and can do, AFTER learning, with “next class period” feedback.

Knowledge and Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to reduce the time needed for students to write, and for me to score, short answer and essay questions. I created software (1981) to score multiple-choice, both right-count, and knowledge and judgment, to encourage students to take responsibility for what they were learning at all levels of thinking in any subject area. Students voted to give knowledge and judgment equal value. The right-count score retains the same meaning (quantity of right marks) as above. The knowledge and judgment score is a composite of the judgment score (quality, the “feel good” score AFTER learning) and the right-count score (quantity). Power Up Plus (2006) is classroom friendly (for students and teachers) and a free download: Smarter Test Scoring and Item Analysis.

Knowledge Factor (1995-    )   Promotes student learning and retention by assessing student knowledge and confidence, DURING learning, with “instant” feedback to develop “feeling good” during learning.

Knowledge Factor was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment program, Amplifire, places much more weight on confidence than on knowledge (a wrong mark may reduce the score by three times as much as a right mark adds). The software leads students through the steps needed to learn easily, quickly and in a depth that is easily retained for more than a year. Students do not have to master the study skills and the sense of responsibility needed to learn at all levels of thinking needed for master with KJS. Amplifire is student friendly, online, and so very commercially successful in developed topics that it is not free.

[Judgment and confidence are not the same thing. Judgment is measured by performance (percent of right marks), AFTER learning, at any level of student score. Confidence is a good feeling that Amplifier skillfully uses to promote rapid learning, DURING learning and self-assessment, into a mastery level. Students can take confidence in their practiced and applied self-judgment. The KJS and Amplifire test scores reflect the complete student. IMHO standardized tests should do this also, considering their cost in time and money.]

Wednesday, December 10, 2014

Information Functions - Adding Unbalanced Items

Adding 22 balanced items to Table 33 of 21 items, in the prior post, resulted in a similar average test score (Table 36) and the same item information functions (the added items were duplicates of those in the first Nurse124 data set of 21 items.) What happens if an unbalance set of 6 items is added? I just deleted the 16 high scoring additions from Table 36. Both balanced additions (Table 36) and unbalanced additions (Table 39) had the same extended range of item difficulties (5 to 21 right marks, or 23% to 95% difficulty).

Table 33
Table 36
Table 39

Adding a balanced set of items to the Nurse124 data set kept the average score the same: 80% and 79% (Table 36). Adding a set of more difficult items to the Nurse124 data decreased the average score to 70% (Table 39) and decreased student scores. Traditionally, a student’s overall score is then the average of the three test scores: 80%, 79% and 70% or 76% for an average student (Tables 33, 36, and 39). An estimate of a student’s “ability” is thus directly dependent upon his test scores which are dependent upon the difficulty of the items on each test. This score is accepted as a best estimate of the student’s true score. This value is a best guess of future test scores. This makes common sense, that past is a predictor of future performance.

 [Again a distinction must be made between what is being measured by right mark scoring (0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a test the student may not be able to read or understand. The other also indicates the quality of each student’s knowledge; the ability to make meaningful use of knowledge and skills. Both methods of analysis can use the exact same tests. I continue to wonder why people are still paying full price but harvesting only a portion of the results.]

The Rasch model IRT takes a very different route to “ability”. The very same student mark data sets can be used. Expected IRT student scores are based on the probability that half of all students with a given ability location will correctly mark a question with a comparable difficulty location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.)  [The location starts from the natural log of a ratio of right/wrong score and wrong/right difficulty. A convergence of score and difficulty yields the final location. The 50% test score becomes the zero logit location, the only point right mark scoring and IRT scores are in full agreement.]

The Rasch model IRT converts student scores and item difficulties [in the marginal cells of student data] into the probabilities of a right answer (Table 33b). [The probabilities replace the marks in the central cell field of student data.] It also yields raw student scores, and their conditional standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather than the count of right marks. (For more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student test score; a student with a given ability can obtain a range of scores on future tests without affecting his ability location. A calibrated item can yield a range of difficulties on future tests without affecting its difficulty calibrated location. This makes sense only in relation to the trust you can have in the person interpreting IRT results; that person’s skill, knowledge, and (most important) experience at all levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look right”, can be eliminated from the data set. Also the same data set (Table 33, Table 36, and Table 39) can be treated differently if it is classified as field test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively equilibrated test results to optimize federal dollars during the NCLB era by showing gradual continuing improvement.  It is time to end the ranking of students by right mark scoring (0,1 scoring) and include KJS, or PCM (0,1,2 scoring) [that about every state education department has: Winsteps], so that standardized testing yields the results needed to guide student development: the main goal of the CCSS movement.

The need to equilibrate a test is an admission of failure. The practice has become “normal” because failure is so common. It opened the door to cheating at state and national levels. [To my knowledge no one has been charged and convicted of a crime for this cheating.] Current computer adaptive testing (CAT) hovers about the 50% level of difficulty. This optimizes psychometric tools. Having a disinterested party outside of the educational community doing the assessment analysis and online CAT reduce the opportunity to cheat. They do not IMHO optimize the usefulness of the test results. End-of-course tests are now molding standardized testing into an instrument to evaluate teacher effectiveness rather than assess student knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, November 12, 2014

Information Functions - Adding Balanced Items

I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information functions (IIF). This post makes use of that observation to improve the Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post, contained no items with difficulties below 45%. The item information functions (IIF) were also skewed (Chart 75). This is not the symmetrical display associated with the Rasch IRT model. I reasoned that adding a balanced set of items would increase the number of IFFs without changing the average item difficulty.

Table 36a shows the addition of a balanced set of 22 items to the Nurse124 data set of 21 items. As each lower ranking item was added, one or more high ranking items were added to keep the average test score near 80%. This table added six lower ranking items and 16 higher scoring items resulting in an average score of 79% and 43 items total.

Table 36
The average item difficulty for the Nurse124 data set was 17.57 and the expanded set was 17.28. The average test score of 80% came in as 79%. Student scores (ability) also remained about the same. [I did not take the time to tweak the additions for a better fit.] Both item difficulty and student score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did change with the addition of more items (Chart 79 below). The number of cells containing information expanded from 99 to 204 cells. The average right count student score increased from 17 to 34.

Table 36c shows the resulting item information functions (IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set of 9 different student scores now contains 12 different scores, however the range of student scores is comparable between the two sets. This makes sense as the average test scores are similar and the student scores are also about the same.
Table 37
Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the student ability range of 12 rankings (student score right count/% right). The trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near the average test score of 79%. This was expected as the maximum information value within an IIF occurs when the item difficulty and student ability score match. [The three bottom traces on Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly skewed the further the IIFs are away from this maximum, 11/50%, trace (blue square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average test score, produced the identical total information (1.41) from both the Nurse124 and the supplemented data sets. But these values also drifted apart for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum information (2.17). Here again is evidence that prompts psychometricians to work closely to the 50% or zero logit point to optimize their tools when working on low quality data (limiting scoring only to right counts rather than also offering students the option to assess their judgment to report what is actually meaningful and useful; to assess their development toward being a successful, independent, high quality achiever). [Students that only need some guidance rather than endless “re-teaching”; that, for the most part, consider right count standardized tests a joke and a waste of time.]
Chart 78

Tabel 38
The test information function for the supplemented data set Is the sum of the information in all 17 item information functions (Table 38 and Chart 78). It took 16 easy items to balance 6 difficult items. The result was a marked increase in precision at the student score levels between 30/70% and 32/74%. [More at Rasch Model Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124 data, the supplemented data (adding a balanced set of items that keeps student ability and item difficulty unchanged), and the CTT and IRT data reduction methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM) for comparison. In general, both CTT (blue) and IRT inverted (red) produced  comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to 2.54. The standard deviation (SD) between student test scores increased from 2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and 57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the student test score the lower the CTT CSEM, the more precise the student score measurement, the less error. That makes sense.

The above statement about a CTT CSEM must be related to a second statement that the more item information, the greater the precision of measurement by the item at this student score rank. The first statement harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks) in Table 36c.

The binomial variance CTT CSEM view is then comparable to the reciprocal or inverted (1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue, CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced similar results even with an average test score of 79% that is 29 percentage points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information functions, in Table 36c from columns of probabilities (of right marks). Layering one IIF on top of another across the student score distribution yields the test information function (Chart 78).

The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting
a right answer that were generated from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet, at the count only right mark level, they deliver very similar results, with the exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.