Wednesday, March 11, 2015

Modernizing Standardize Test Scores

                                                             #13
A single standardized right-count score (RCS) has little meaning beyond a ranking. A knowledge and judgment score (JKS) from the same set of questions not only tells us how much the student may know or can do but also the judgment to make use of that knowledge and skill. A student with a RCS must be told what he/she knows or can do. A student with a KJS tells the teacher or test maker what he/she knows. A RCS becomes a token in a federally sponsored political game. A KJS is a base onto which students build further learning and teachers build further instruction.

Table 40. RCS
Table 41. KJS
The previous two posts dealt with student ability during the test. This one looks at the score after the test. I developed four runs of the Visual Education Statistics Engine: Table 40. RCS, Table 41. KJS (simulated), and after maximizing item discrimination, Table 42. RCSmax, and Table 43. KJSmax. 

Table 42. RCSma
Table 43. KJSmax
Test reliability and the standard error of measurement (SEM) with some related statistics are gathered into Table 44. The reliability and SEM values are plotted on Chart 81 below.

Table 44
Students, on average, can reduce their wrong marks by about one half when they at first switch to knowledge and judgment scoring. The most obvious effect of changing 24 of 48 zeros to a value of 0.5 to simulate Knowledge and Judgment Scoring (KJS) was to reduce test reliability (0.36, red). Scoring both quantity and quality also increased the average test score from 64% to 73%.

Psychometricians do not like the reduction in test reliability. Standardized paper tests were marketed as “the higher the reliability the better the test”. Marketing has now moved to “the lower the standard error of measurement (SEM), the better the test”, using computers, CAT and online testing (green). The simulated KJS shows a better SEM (10%) in relation to 12% for RCS. By switching current emphasis from test reliability to precision (SEM) KJS now shows a slight advantage to test makers over RCS.

Chart 80
Chart 80 shows the general relationships between a right-count score and a KJS. This is Chart 4/4 from the previous post tipped on its side with the 60% passing performance replaced with the average scores of 64% RMS and 73% KJS. Again, KJS is not a giveaway. There is an increase in the score, if the student elects to use his/her judgment. There is also an increase in the ability to know what a student actually knows because the student is given the opportunity to report what is known, not to just to mark an answer to every question (even before looking at the test).

Chart 81
Chart 81 expands Chart 80 using the statistics in Table 44. In general there is little difference between a right-count score and a KJS, statistically. What is different is what is known about the student; the full meaning of the score. Right-count scoring delivers a score on a test carefully crafted to deliver a desired on-average test score distribution and cut score. THE TEST IS DESIGNED TO PRODUCE THE DESIRED SCORE DISTRIBUTION. The KJS adds to this the ability to assess what students actually know and can do that is of value to them. The knowledge and judgment score assess the complete student (quantity and quality).

Knowledge and Judgment Scoring requires appropriate implementation for the maximum effect on student development. In my experience, the switch from RCS must be voluntary to promote student development. It must result in a change in the level of thinking and related study habits where the student assumes responsibility for learning and reporting. At that time students feel comfortable changing scoring methods. They like the quality score. It reassures them that they really can learn and understand.

KJS no longer has a totally negative effect on current psychometrician attempts to sharpen their data reduction tools. But there are still the effects of tradition and project size. The NCLB movement demonstrated (failed in part) because low performing schools mimicked the standardized tests rather than tended to teaching and learning. Their attempt to succeed was counterproductive. Doing more of the same does not produce different results. These schools could also be expected to mimic standardized tests offering KJS.

The current CCSS movement is based on the need for one test for all in an attempt to get valid comparisons between students, teachers, schools and states. The effect has been gigantic contracts that only a few companies have the capacity to bid on and little competition to modernize their test scoring.

KJS is then a supplement to RCS. It can be offered on standardized tests. As such, it updates the multiple-choice test to its maximum potential, IMHO. KJS can be implemented in the classroom, by testing companies and entrepreneurs who see the mismatch between instruction and assessment.


Knowledge Factor has already done this with their patented learning/assessment system, Amplifire. It can prepare students online for current standardized tests. Power Up Plus is free for paper classroom tests. (Please see the two preceding posts for more details related to student ability during the test).

Wednesday, February 11, 2015

Learning Assessment Responsibilities

Students, teachers, and test makers each have responsibilities that contribute to the meaning of a multiple-choice test score. This post extracts the responsibilities from the four charts in the prior post, Meaningful Multiple-Choice Test Scores, that compares short answer, right-count traditional multiple-choice, and knowledge and judgment scoring (KJS) of both.

Testing looks simple: learn, test, and evaluate. Short answer, multiple-choice, or both with student judgment. Lower levels of thinking, higher levels of thinking, or both as needed. Student ability below, on level, or above grade level. There are many more variables for standardized test makers to worry about in a nearly impossible situation. By the time these have been sanitized from their standardized tests all that remains is a ranking on the test that is of little if any instructional value (unless student judgment is added to the scoring).

Chart 1/4 compares a short answer and a right-count traditional multiple-choice test. The teacher has the most responsibility for the test score when working with pupils at lower levels of thinking (60%). A high quality student functioning at higher levels of thinking could take the responsibility to report what is known or can be done in one pass and then just mark the remainder for the same score (60%). The teacher’s score is based on the subjective interpretation of the student’s work. The student’s score is based on a matching of the subjective interpretation of the test questions with test preparation. [The judgment needed to do this is not recorded in traditional multiple-choice scores.]

Chart 2/4 compares what students are told about multiple-choice tests and what actually takes place. Students are told the starting score is zero. One point is added for each right mark. Wrong or blank answers add nothing. There is no penalty. Mark an answer to every question. As a classroom test, this makes sense if the results are returned in a functional formative assessment environment. Teachers have the responsibility to sum several scores when ranking students for grades.

As a standardized test, the single score is very unfair. Test makers place great emphasis on the right-mark after-test score and the precision of their data reduction tools (for individual questions and for groups of students). They have a responsibility of pointing out that the student on either side of you has an unknowable, different, starting score from chance, let alone your luck on test day. The forced-choice test actually functions as a lottery. Lower scoring students are well aware of this and adjust their sense of responsibility accordingly (in the absence of a judgment or quality score to guide them).

Chart 3/4 compares student performance by quality. Only a student with a well-developed sense of responsibility, or a comparable innate ability, can be expected to function as a high quality, high scoring, student (100% but reported as 60%). A less self-motivated student or with less ability can perform two passes at 100% and 80% to also yield 60%. The typical student, facing a multiple-choice test, will make one pass; marking every question as it comes to earn a quantity, quality, and test score of 60%; a rank of 60%. No one knows which right mark is a right answer.

Teachers and test makers have a responsibility to assess and report individual student quality on multiple-choice tests just as is done on short-answer, essay, project, research, and performance tests. These notes of encouragement and direction provide the same “feel good” effect found in a knowledge and judgment scored quality score when accompanied with a list of what was known or could be done (the right-marked questions).

Chart 4/4 shows knowledge and judgment scoring (KJS) with a five-option question made from a regular four-option question plus omit. Omit replaces “just marking”. A short answer question scored with KJS earns one point for judgment and +/-1 point for right or wrong. An essay question expecting four bits of information (short sentence, relationship, sketch, or chart) earns 4 points for judgment and +/-4 points for an acceptable or not acceptable report. (All fluff, filler, and snow are ignored. Students quickly learn to not waste time on these unless the test is scored at the lowest level of thinking by a “positive” scorer.)

Each student starts with the same multiple-choice score: 50%. Each student stops when each student has customized the test to that student’s preparation. This produces an accurate, honest and fair test score. The quality score provides judgment guidance for students at all levels. It is the best that I know of when operating with paper and pencil. Power Up Plus is a free example. Amplifire refines judgment into confidence using a computer, and now on the Internet. It is just easier to teach a high quality student who knows what he/she knows.


Most teachers I have met question the score of 60% from KJS. How can a student get a score of 60% and only mark 10% of the questions right? Easy. Sum 50% for perfect judgment, 10% for right answers, and NO wrong. Or sum 10% right, 10% right and 10% wrong, and omit 20%. If the student in the example chose to mark 10% right (a few well mastered facts) and then just marked the rest (had no idea how to answer) the resulting score falls below 40% (about 25% wrong). With no judgment, the two methods of scoring (smart and dumb) produce identical test scores. KJS is not a give-away. It is a simple, easy way to update currently used multiple-choice questions to produce an accurate, honest, and fair test score. KJS records what right-count traditional multiple-choice misses (judgment) and what the CCSS movement tries to promote.

Wednesday, January 14, 2015

Meaningful Multiple-Choice Test Scores

The meaning of a multiple-choice test score is determined by several factors in the testing cycle including test creation, test instructions, and the shift from teacher to student being responsible for learning and reporting. Luck-on-test-day, in this discussion, is considered to have similar effects on the following scoring methods.

[Luck-on-test-day includes but is not limited to: test blueprint, question author, item calibration, test creator, teacher, curriculum, standards; classroom, home, and in between, environment; and a little bit of random chance (act of God that psychometricians need to smooth their data).]                             

Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part multiple-choice, and knowledge and judgment scoring (KJS) for both short answer and multiple-choice. These range from familiar manual scoring to what is now easily done with KJS computer software. Each method of scoring has a different starting score with a different meaning. The average customary class room score of 75% is assumed (60% passing).

Chart 1/4

Open ended short answer scores start with zero and increase with each acceptable answer. There may be several acceptable answers for a single short answer question. The level of thinking required depends upon the stem of the question. There may be an acceptable answer for a question both at lower and at higher levels of thinking. These properties carry over into KJS below.

The teacher or test maker is responsible for scoring the test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4). The quality of the answers can be judged by the scorer and may influence which ones are considered right answers.

The open ended short answer question is flexible (multiple right answers) and with some subjectivity; different scorers are expected to produce similar scores. The average test score is controlled by selecting a set of items that is expected to yield an average test score of 75%. The student test score is a rank based on items included in the test to survey what students were expected to master, to group students who know from those who do not know each item, and items that fail to show mastery or discrimination (unfinished items for a host of reasons including luck-on-test-day above). 

The open ended short answer question can also be scored as a multiple-choice item. First tabulate the answers. Sort the answers from high to low count.  The most frequent answer, on a normal question, will be the right answer option. The next three ranking answers will be real student supplied wrong answer options (rather than test writer created wrong answer options). This pseudo-multiple-choice item can now be printed as a real question on your next multiple-choice test (with answers scrambled).

A high quality student could also mark only right answers on the first pass using the above test (Chart 1/4) and then finish by just marking on the second pass to earn a score of 60%. A lower quality student could just mark each item in order, as is usually done on multiple-choice tests, mixing right and wrong marks, to earn the same score of 60%. Using only a score after the test we cannot see what is taking place during the test. Turning a short answer test into traditional multiple-choice hides student quality, the very thing that the CCSS movement is now promoting.
Chart 2/4

Closed ended right-count four-option multiple-choice scores start with zero and increase with each right mark. Not really!! This is only how this method of scoring has been marketed for a century by only considering a score based on right-counts after the test is completed. In the first place traditional multiple-choice is not multiple-choice, but forced-choice (it lacks one option discussed below). This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This evil flaw in test design was countered, over 50 years ago, by a now defunct “formula scoring”. After forcing students to guess, psychometricians wanted to remove the effect of just marking! It took the SAT until March of this year, 2014, to drop this “score correction”. 

[Since there was no way to tell which right answer must be changed for the correction, it made no sense to anyone other than psychometricians wanting to optimize their data reduction tools, with disregard for the effect of the correction on the students taking such a test. Now that 4-option questions have become popular on standardized tests, a student who can eliminate one option can guess from the remaining three for better odds on getting a right mark (which is not necessarily a right answer that reflects recall, understanding, or skill).]

The closed ended right-count four-option multiple-choice question is inflexible (one right answer) and with no scoring subjectivity; all scorers yield the same count of right marks. Again, the average test score is controlled by selecting a set of items expected to yield 75% on-average (60% passing). However, this 75% is not the same as that for the open ended short answer test. As a forced-choice test, the multiple-choice test will be easier; it starts with a 25% on-average advantage. (That means one student may start with 15% and a classmate with 35%.) To further confound things, the level of thinking used by students can also vary. A forced-choice test can be marked entirely at lower levels of thinking.

[Standardized tests control part of the above problems by eliminating almost all mastery and unfinished items. The game is to use the fewest items that will produce a desired score distribution with an acceptable reliability. A traditional multiple-choice scored standardized test score of 60% is a much more difficult accomplishment than the same score on a classroom test.]

A forced-choice test score is a rank of how well a student did on a test. It is not a report of what a student actually knows or can do that will serve as the basis for further instruction and learning. The reasoning is rather simple: the forced-choice score is counted up AFTER the test is finished; this is the final game score. How the game started (25% on-average) and was played is not observed (but this is what sports fans pay for). This is what students and teachers need to know so students can take responsibility for self-corrective learning.

Chart 3/4
[Three student performances that all end up with a traditional multiple-choice score of 60% are shown in Chart 3/4. The highest quality student used two passes, “I know or can do this or I can eliminate all the wrong options” and “I don’t have a clue”. The next lower quality student used three passes, “I know or can do this”; “I can eliminate one or more answer options before marking” and “I am just marking.” The lowest level of thinking student just marks answers one pass, right and wrong, as most low quality, lower level of thinking students do. But what takes place during the test is not seen in the score made after the test. The lowest quality student must review all past work (if tests are cumulative) or continue on with an additional burden as a low quality student. A high quality student needs only to check on what has not been learned.]

Chart 4/4

Knowledge and Judgment scores start at 50% for every student plus one point for acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice). (Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right  = 2) Omitting an answer is good judgment to report what has yet to be learned or to be done (understood). Omitting keeps the one point for good judgment. An unacceptable or wrong mark is poor judgment. You lose one point for bad judgment.

Now what is hidden with forced-choice scoring is visible with knowledge and Judgment Scoring (KJS). Each student can show how the game is played. There is a separate student score for quantity and for quality. A starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge Factor sets the starting score near 75%. Judgment is far more important than knowledge in high risk occupations.]

KJS includes a fifth answer option: omit (good judgment to report what has yet to be learned or understood). When this option is not used, the test reverts to forced-choice scoring (marking one of the four answer options for every question).

A high quality student marked 10 right out of 10 marked and then omitted the remainder (in two passes through the test) or managed to do a few of one right and one wrong (three passes) for a passing score of 60% in Chart 4/4. A student of less quality did not omit but just marked for a score of less than 50%. A lower level of thinking, low quality student marked 10 right and just marked the rest (two passes) for a score of less than 40%. KJS yields a score based on student judgment (60%) or on the lack of that judgment (less than 50%).

In summary, the current assessment fad is still oriented on right marks rather than on student judgment (and development). Students with a practiced good judgment develop the sense of responsibility needed to learn at all levels of thinking. They do not have to wait for the teacher to tell them they are right. Learning is stimulated and exhilarating. It is fun to learn when you can question, get answers, and verify a right answer or a new level of understanding; when you can build on your own trusted foundation.

Low quality students learn by repeating the teacher. High quality students learn by making sense of an assignment. Traditional multiple-choice (TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards all-levels-of-thinking. TMC requires little sense of responsibility. KJS rewards (encourages) the sense of responsibility needed to learn at all levels of thinking.

1.     A short answer, hand scored, test score is an indicator of student ability and class ranking based on the scorer’s judgment. The scorer can make a subjective estimate of student quality.

2.     A TMC score is only a rank on a completed test with increased confounding at lower scores. A score matching a short answer score is easier to obtain in the classroom and much more difficult to obtain on a standardized test.

3.     A KJS test score is based on a student, self-reporting, estimate of what the student knows and can do on a completed test (quantity) and an estimate of the student’s ability to make use of that knowledge (judgment) during the test (quality). The score has student judgment and quality, not scorer judgment and quality.

In short, students who know that they can learn (get rapid feedback on quantity and quality),who want to learn, enjoy learning (see Amplifire below). All testing methods fail to promote these student development characteristics unless the test results are meaningful, easy to use by students and teachers, and timely. Student development requires student performance, not just talking about it or labeling something formative assessment.  

Power Up Plus (PUP or PowerUP) scores both TMC and KJS. Students have the option of selecting the method of scoring they are comfortable with. Such standardized tests have the ability to estimate the level of thinking used in the classroom and by each student.  Lack of information, misinformation, misconceptions and cheating can be detected by school, teacher, classroom, and student.

Power Up Plus is hosted at TeachersPayTeachers to share what was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual teachers who want to upgrade their multiple-choice tests for formative, cumulative, and exit ticket assessment. Good teachers, working within the bounds of accepted standards, do not need to rely on expensive assessments. They (and their students) do need fast, easy to use, test results to develop successful high quality students.

I hope your students respond with the same positive enthusiasm that over 90% of mine did. We need to assess students to promote their abilities. We do not need to primarily assess students to promote the development of psychometric tools that yield far less than what is marketed.

A Brief History:

Geoff Masters (1950-    )   A modification of traditional multiple-test test performance.

Created partial credit scoring for the Rasch model (1982) as a scoring refinement for traditional right-count multiple-choice. It gives partial credit for near right marks. It does not change the meaning of the right-count score (as quantity and quality have the same value by default [both wrong marks and blanks are counted as zeros], only quantity is scored). The routine is free in Ministep software.

Richard A. Hart (1930-    )   Promotes student development by student self-assessment of what each student actually knows and can do, AFTER learning, with “next class period” feedback.

Knowledge and Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to reduce the time needed for students to write, and for me to score, short answer and essay questions. I created software (1981) to score multiple-choice, both right-count, and knowledge and judgment, to encourage students to take responsibility for what they were learning at all levels of thinking in any subject area. Students voted to give knowledge and judgment equal value. The right-count score retains the same meaning (quantity of right marks) as above. The knowledge and judgment score is a composite of the judgment score (quality, the “feel good” score AFTER learning) and the right-count score (quantity). Power Up Plus (2006) is classroom friendly (for students and teachers) and a free download: Smarter Test Scoring and Item Analysis.

Knowledge Factor (1995-    )   Promotes student learning and retention by assessing student knowledge and confidence, DURING learning, with “instant” feedback to develop “feeling good” during learning.

Knowledge Factor was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment program, Amplifire, places much more weight on confidence than on knowledge (a wrong mark may reduce the score by three times as much as a right mark adds). The software leads students through the steps needed to learn easily, quickly and in a depth that is easily retained for more than a year. Students do not have to master the study skills and the sense of responsibility needed to learn at all levels of thinking needed for master with KJS. Amplifire is student friendly, online, and so very commercially successful in developed topics that it is not free.


[Judgment and confidence are not the same thing. Judgment is measured by performance (percent of right marks), AFTER learning, at any level of student score. Confidence is a good feeling that Amplifier skillfully uses to promote rapid learning, DURING learning and self-assessment, into a mastery level. Students can take confidence in their practiced and applied self-judgment. The KJS and Amplifire test scores reflect the complete student. IMHO standardized tests should do this also, considering their cost in time and money.]

Wednesday, December 10, 2014

Information Functions - Adding Unbalanced Items

                                                                13
Adding 22 balanced items to Table 33 of 21 items, in the prior post, resulted in a similar average test score (Table 36) and the same item information functions (the added items were duplicates of those in the first Nurse124 data set of 21 items.) What happens if an unbalance set of 6 items is added? I just deleted the 16 high scoring additions from Table 36. Both balanced additions (Table 36) and unbalanced additions (Table 39) had the same extended range of item difficulties (5 to 21 right marks, or 23% to 95% difficulty).

Table 33
Table 36
Table 39

Adding a balanced set of items to the Nurse124 data set kept the average score the same: 80% and 79% (Table 36). Adding a set of more difficult items to the Nurse124 data decreased the average score to 70% (Table 39) and decreased student scores. Traditionally, a student’s overall score is then the average of the three test scores: 80%, 79% and 70% or 76% for an average student (Tables 33, 36, and 39). An estimate of a student’s “ability” is thus directly dependent upon his test scores which are dependent upon the difficulty of the items on each test. This score is accepted as a best estimate of the student’s true score. This value is a best guess of future test scores. This makes common sense, that past is a predictor of future performance.

 [Again a distinction must be made between what is being measured by right mark scoring (0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a test the student may not be able to read or understand. The other also indicates the quality of each student’s knowledge; the ability to make meaningful use of knowledge and skills. Both methods of analysis can use the exact same tests. I continue to wonder why people are still paying full price but harvesting only a portion of the results.]

The Rasch model IRT takes a very different route to “ability”. The very same student mark data sets can be used. Expected IRT student scores are based on the probability that half of all students with a given ability location will correctly mark a question with a comparable difficulty location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.)  [The location starts from the natural log of a ratio of right/wrong score and wrong/right difficulty. A convergence of score and difficulty yields the final location. The 50% test score becomes the zero logit location, the only point right mark scoring and IRT scores are in full agreement.]

The Rasch model IRT converts student scores and item difficulties [in the marginal cells of student data] into the probabilities of a right answer (Table 33b). [The probabilities replace the marks in the central cell field of student data.] It also yields raw student scores, and their conditional standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather than the count of right marks. (For more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student test score; a student with a given ability can obtain a range of scores on future tests without affecting his ability location. A calibrated item can yield a range of difficulties on future tests without affecting its difficulty calibrated location. This makes sense only in relation to the trust you can have in the person interpreting IRT results; that person’s skill, knowledge, and (most important) experience at all levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look right”, can be eliminated from the data set. Also the same data set (Table 33, Table 36, and Table 39) can be treated differently if it is classified as field test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively equilibrated test results to optimize federal dollars during the NCLB era by showing gradual continuing improvement.  It is time to end the ranking of students by right mark scoring (0,1 scoring) and include KJS, or PCM (0,1,2 scoring) [that about every state education department has: Winsteps], so that standardized testing yields the results needed to guide student development: the main goal of the CCSS movement.


The need to equilibrate a test is an admission of failure. The practice has become “normal” because failure is so common. It opened the door to cheating at state and national levels. [To my knowledge no one has been charged and convicted of a crime for this cheating.] Current computer adaptive testing (CAT) hovers about the 50% level of difficulty. This optimizes psychometric tools. Having a disinterested party outside of the educational community doing the assessment analysis and online CAT reduce the opportunity to cheat. They do not IMHO optimize the usefulness of the test results. End-of-course tests are now molding standardized testing into an instrument to evaluate teacher effectiveness rather than assess student knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, November 12, 2014

Information Functions - Adding Balanced Items

                                                               12
I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information functions (IIF). This post makes use of that observation to improve the Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post, contained no items with difficulties below 45%. The item information functions (IIF) were also skewed (Chart 75). This is not the symmetrical display associated with the Rasch IRT model. I reasoned that adding a balanced set of items would increase the number of IFFs without changing the average item difficulty.

Table 36a shows the addition of a balanced set of 22 items to the Nurse124 data set of 21 items. As each lower ranking item was added, one or more high ranking items were added to keep the average test score near 80%. This table added six lower ranking items and 16 higher scoring items resulting in an average score of 79% and 43 items total.

Table 36
The average item difficulty for the Nurse124 data set was 17.57 and the expanded set was 17.28. The average test score of 80% came in as 79%. Student scores (ability) also remained about the same. [I did not take the time to tweak the additions for a better fit.] Both item difficulty and student score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did change with the addition of more items (Chart 79 below). The number of cells containing information expanded from 99 to 204 cells. The average right count student score increased from 17 to 34.

Table 36c shows the resulting item information functions (IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set of 9 different student scores now contains 12 different scores, however the range of student scores is comparable between the two sets. This makes sense as the average test scores are similar and the student scores are also about the same.
Table 37
Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the student ability range of 12 rankings (student score right count/% right). The trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near the average test score of 79%. This was expected as the maximum information value within an IIF occurs when the item difficulty and student ability score match. [The three bottom traces on Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly skewed the further the IIFs are away from this maximum, 11/50%, trace (blue square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average test score, produced the identical total information (1.41) from both the Nurse124 and the supplemented data sets. But these values also drifted apart for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum information (2.17). Here again is evidence that prompts psychometricians to work closely to the 50% or zero logit point to optimize their tools when working on low quality data (limiting scoring only to right counts rather than also offering students the option to assess their judgment to report what is actually meaningful and useful; to assess their development toward being a successful, independent, high quality achiever). [Students that only need some guidance rather than endless “re-teaching”; that, for the most part, consider right count standardized tests a joke and a waste of time.]
Chart 78

Tabel 38
The test information function for the supplemented data set Is the sum of the information in all 17 item information functions (Table 38 and Chart 78). It took 16 easy items to balance 6 difficult items. The result was a marked increase in precision at the student score levels between 30/70% and 32/74%. [More at Rasch Model Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124 data, the supplemented data (adding a balanced set of items that keeps student ability and item difficulty unchanged), and the CTT and IRT data reduction methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM) for comparison. In general, both CTT (blue) and IRT inverted (red) produced  comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to 2.54. The standard deviation (SD) between student test scores increased from 2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and 57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the student test score the lower the CTT CSEM, the more precise the student score measurement, the less error. That makes sense.

The above statement about a CTT CSEM must be related to a second statement that the more item information, the greater the precision of measurement by the item at this student score rank. The first statement harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks) in Table 36c.

The binomial variance CTT CSEM view is then comparable to the reciprocal or inverted (1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue, CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced similar results even with an average test score of 79% that is 29 percentage points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information functions, in Table 36c from columns of probabilities (of right marks). Layering one IIF on top of another across the student score distribution yields the test information function (Chart 78).


The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting
a right answer that were generated from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet, at the count only right mark level, they deliver very similar results, with the exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, October 8, 2014

Customizing Test Precision - Information Functions

                                                               11

(Continued from the prior two posts.)

The past two posts have established that there is little difference between classical test theory (CTT) and item response theory (IRT) in respect to test reliability and conditional error of measurement (CSEM) estimates (other than the change in scales). IRT now is the analysis of choice for standardized tests. The Rasch model IRT is the easiest to use and also works well with small data sets including classroom tests. How two normal scales for student scores and item difficulties are combined onto one IRT logit scale is no longer a concern to me, other than the same method must be used throughout the duration of an assessment program.

Table 33
What is new and different from CTT is an additional insight from the IRT data in Table 32c (information p*q values). I copied Table 32 into Table 33 with some editing. I colored the cells holding the maximum amount of information (0.25) yellow in Table 33c. This color was then carried back to Table 33a, Right and Wrong Marks. [Item Information is related to the marginal cells in Table 33a (as probabilities), and not to the central cell field (as mark counts).] The eleven item information functions (in columns) were re-tabled into Table 34 and graphed in Chart 75. [Adding the information in rows yields the student score CSEM in Table 33c.]

Table 34
Chart 75
The Nurse124 data yielded an average test score of 16.8 marks or 80%. This skewed the item information functions away from the 50% or zero logit difficulty point (Chart 75). The more difficult the item, the more information developed, from 0.49 to 1.87 for 95% right count to a maximum at 54% and 45% right count. [No item on the test had a difficulty of 50%.]

Table 35
Chart 76
The sum of information (59.96) by item difficulty level and student score level is tabled in Table 35 and plotted as the test information function in Chart 76. This test does not do a precise job of assessing student ability. The test was most precise (19.32) at the 16 right count/76% right location. [Location can be designated by measure (logit), input raw score (red) or output expected score (Table 33b).]

The item with an 18 right count/92% right difficulty (Table 35) did not contribute the most information individually but did as a group of three items (9.17).  The three highest scoring, easiest, items (counts of 19, 20, and 21) are just too easy for a standardized test but may be important survey items needed to verify knowledge and skills for this class of high performing students. None of these three items reached an information level maximum of 1/4. [It now becomes apparent how items can be selected to produce a desired test information function.]

The more information available is interpreted as greater precision or less error (smaller CSEM in Table 33c). [CSEM = 1/SQRT(SUM(p*q)) on Table 33c. p*q is at a maximum when p = q; when right = wrong: (RT x WG)/(RT + WG)^2 or (3 x 3)/36 = 1/4].

Each item information function spans the range of student scores on the test (Chart 76). Each item information function measures student ability most precisely near the point that item difficulty and student ability match (50% right) along the IRT S-curve. [The more difficult an item, the more ability students must have to mark correctly 50% of the time. Student ability is the number correct on the S-curve. Item difficulty is the number wrong on the S-curve (see more at Rasch Model Audit).]   

Extracting item information functions from a data table provides a powerful tool (a test information function) for psychometricians to customize a test (page 127, Maryland 2010). A test can be adjusted for maximum precision (minimum CSEM) at a desired cut point.

The bright side of this is that the concept of “information” (not applicable to CTT), and the ability to put student ability and item difficulty on one scale, gives psychometricians powerful tools. The dark side is that the form in which the test data is obtained remains at the lowest levels of thinking in the classroom. Over the past decade of the NCLB era, as psychometrics has made marked improvements, the student mark data it is being supplied has remained in the casino arena: Mark an answer to each question (even if you cannot read or understand the question), do not guess, and hope for good luck on test day.

The concepts of information, item discrimination and CAT all demand values hovering about the 50% point for peak psychometric performance. Standardized testing has migrated away from letting students report what they know and can do to a lottery that compares their performance (luck on test day) on a minimum set of items randomly drawn from a set calibrated on the performance of a reference population on another test day. 

The testing is optimized for psychometric performance, not for student performance. The range over which a student score may fall is critical to each student. The more precise the cut score, the narrower this range, the lower the number of students that fall below that point on the score distribution, which may have passed on another test day. In general, no teacher or student will ever know. [Please keep in mind that the psychometrician does not have to see the test questions. This blog has used the Nurse124 data without even showing the actual test questions or the test blueprint.] 

It does not have to be that way. Knowledge and Judgment Scoring (classroom friendly) and the partial credit Rasch model (that is included in the software states use) can both update traditional multiple-choice to the levels of thinking required by the common core state standards (CCSS) movement. We need an accurate, honest and fair assessment of what is of value to students, as well as precise ranking on an efficient CAT. 


- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, September 10, 2014

Conditional Standard Error of Measurement - Precision

                                                              10    
(Continued from prior post.)

Table 32a contains two estimates (red) of the test standard error of measurement (SEM) that are in full agreement.  One estimate, 1.75, is from the average of the conditional standard error of measurements (CSEM, green) for each student raw score. The traditional estimate, 1.74, uses the traditional test reliability, KR20. No problem here.

The third estimate of the test SEM in Table 32c is different. It is based on CSEM values expressed in logits (the natural log, 2.718) rather than on the normal scale. The values are also inverted in relation to the traditional values in Table 32 (Chart 74). There is a small but important difference. The IRT CSEM values are much more linear that the CTT CSEM values. Also the center of this plot is the mean of the number of items (Chart 30, prior post), not the mean of the item difficulties or student scores. [Also most of this chart was calculated as most of these relationships do not require actual data to be charted. Only nine score levels came from the Nurse124 data.]

Chart 74 shows the binomial CSEM values for CTT (normal) and IRT (logit) values obtained by inverting the CTT values: “SEM(Rasch Measure in logits) = 1/(SEM(Raw Score)”, 2007. I then adjusted each of these so the corresponding curves, on the same scale, crossed near the average CSEM or test SEM: 1.75 for CTT and 0.64 for IRT. The extreme values for no right and all right were not included. CSEM values for extreme values go to zero or to infinity with the following result:

“An apparent paradox is that extreme scores have perfect precision, but extreme measures have perfect imprecision.” http://www.winsteps.com/winman/reliability.htm

Precision is then not a constant across the range of student scores for both methods of analysis. The test SEM of 0.64 logits is comparable to 1.74 counts on the normal scale.

The estimate of precision, CSEM, serves three different purposes. For CTT and IRT it narrows down the range in which a student’s test score is expected to fall (1). The average of the (green) individual score CSEM values estimates the test SEM as 1.75 counts out of a range of 21 items. This is less than the 2.07 counts for the test standard deviation (SD) (2). Cut scores with greater precision are more believable and useful.

For IRT analysis, the CSEM indicates the degree that the data fit the perfect Rasch model (3). A better fit also results in more believable and useful results.

“A standard error quantifies the precision of a measure or an estimate. It is the standard deviation of an imagined error distribution representing the possible distribution of observed values around their “true” theoretical value. This precision is based on information within the data. The quality-control fit statistics report on accuracy, i.e., how closely the measures or estimates correspond to a reference standard outside the data, in this case, the Rasch model.” http://www.winsteps.com/winman/standarderrors.htm

Precision also has some very practical limitations when delivering tests by computer adaptive testing (CAT). Linacre, 2006, has prepared two very neat tables showing the number of items that must be on a test to obtain a desired degree of precision expressed in logits and in confidence limits. The closer the test “targets” an average score of 50%, the fewer items needed for a desired precision.

The two top students, with the same score of 20, missed items with different difficulties. They both yield the same CSEM. The CSEM ignores the pattern of marks and the difficulty of items. A CSEM value obtained in this manner is related only to the raw score. Absolute values for the CSEM are sensitive to item difficulty (Table 23a and 23b).

The precision of a cut score has received increasing attention during the NCLB era. In part, court actions have made the work of psychometricians more transparent. The technical report for a standardized test can now exceed 100 pages. There has been a shift of emphasis from test SEM, to individual score CSEM, to IRT information as an explanation of test precision.

 “(Note that the test information function and the raw score error variance at a given level of proficiency [student  score], are analogous for the Rasch model.)” Texas Technical Digest 2005-2006, page 145. And finally, “The conditional standard error of measurement is the inverse of the information function.” Maryland Technical Report—2010 Maryland Mod-MSA: Reading, page 99.

I cannot end this without repeating that this discussion of precision is based on traditional multiple-choice (TMC) that only ranks students, a casino operation. Students are not given the opportunity to include their judgment of what they know or can do that is of value to themselves, and their teachers, in future learning and instruction, as is done with essays, problem solving, and projects. This is easily done with knowledge and judgment scoring (KJS) of multiple-choice tests.


(Continued)


- - - - - - - - - - - - - - - - - - - - -

Table26.xlsm, is now available free by request.

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.



Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.