Wednesday, April 23, 2014

Test Scoring Math Model - Reliability

An estimate of the reliability or reproducibility of a test can be extracted from the variation within the tabled right marks (Table 25). The variance from within the item columns is related to the variance from within the student score column.

The error within items variance (2.96) and total variance (MSS) between student scores (4.08) are both obtained from columns in Table 25b (blue, Chart 68). The true variance is then 4.08 – 2.96 = 1.12.

The ratio of true variance to the total variance between scores (1.12/4.08) becomes an indicator of test reliability (0.28). This makes sense.

A test with perfect reliability (4.08/4.08 = 1.0) would have no variation, error variance = 0, within the item columns in Table 25. A test with no reliability (0.0/4.08) would show equal values (4.08) for within item columns, and between test scores.

The KR20 formula then adjusts the above value (0.28 x 21/20) to 0.29 [from a large population (n) to a small sample value (n-1)]. The KR20 ratio has no unit labels (“var/var” = “”). All of the above takes place on the upper (variance) level of the math model.

Doubling the number of students taking the test (Chart 69) has no effect on reliability. Doubling the number of items doubles the error variance but increases the total variance by the square. The test reliability increases from 0.29 to 0.64.

The square root of the total variance between scores (4.08) yields the standard deviation (SD) for the score distribution [(2.02 for (n) and 2.07 for (n-1)] on the lower floor of the math model.

- - - - - - - - - - - - - - - - - - - - - 

The Best of the Blog - FREE
  • The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.
  • This blog started seven years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns is on a second level.
  • Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xlsQuick Start

Wednesday, March 5, 2014

Test Scoring Math Model - Variance

The first thing I noticed when inspecting the top of the test scoring math model (Table 25) was that the variation within the central cell field has a different reference point (external to the data) than the variation between scores in the marginal cell column (internal to the data). Also the variation within the central cell field (the variance) is harvested in two ways: within rows (scores) and within columns (items).

The mean sum of squared deviations (MSS) or variance within a column or a row has a fixed range (Chart 64 and Chart 65). The maximum occurs when the marks are 1/2 right and 1/2 wrong (1/2 x 1/2 = 1/4 or 25%). [Variance also equals p * q or (Right * Wrong)/(Right + Wrong)] The contribution each mark makes to the variance is distributed along this gentle curve. The variable data are fit to a rigid model.

I obtained the overall shape of these two variances by folding Chart 64 and Chart 65 into Photo 64-65.  The result is a dome or a depression above or below the upper floor of the model.

The peak of the dome (maximum variance) is reached when a student functioning at 50% marks an item with 50% difficulty. Standardized test makers try to maximize this feature of the model. The larger the mismatch between item difficulty and student ability, the lower down the position of the variance on the dome. CAT attempts to adjust item difficulty to match student preparedness.

Chart 66 is a direct overhead view of the dome. Elevation lines have been added at 5% intervals from zero to 25%. I then fitted the data from Nursing124 to the roof of the model. The data only spread over one quadrant of the model. The data could completely cover the dome in an ideal situation in which every combination of score and difficulty occurred.

The total test variance within items is then the sum of the variance within all items (0.04 to 0.25 = 2.96). The total test variance within scores is the sum of the variance of all scores (0.05 to 0.24 = 3.33). See Table 8.

The math model adjusts to fit the data in the marginal cell student score column (variance between scores). The reference point is not a static feature of the model but the average test score (16.77 or 80%). The plot of the variance between scores can be attached to the right side of the math model (Chart 67).

The variance within columns and rows spreads across the static frame of the model. The model then adjusts to fit the variance between scores (rows) to match the spread of the active within rows.

I can see another interpretation of the model variance if the dome is inverted as a depression. As a flight instrument on a blimp: pitch, roll, and yaw (within item, 2.96; within score, 3.31; and between scores, 4.10) the blimp would have the nose up, rolled to the side, and with the rudder hard over.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, February 19, 2014

Test Scoring Math Model - Input

The mathematical model (Table 25) in the previous post relates all the parts of a traditional item analysis including the observed score distribution, test reproducibility, and the precision of a score. Factors that influence test scores can be detected and measured by the variation between and within selected columns and rows.

The model is only aware of variation within and between mark patterns (deviations from the mean). The variance (the sum of squared deviations from the mean divided by the number summed or the mean sum of squares or MSS) is the property of the data that relates the mark patterns to the normal distribution. This permits generating useful descriptive and predictive insights.

The deviation of each mark from the mean is obtained by subtracting the mean from the value of the mark (Table 25a). The squared deviation value is then elevated to the upper floor of the model (Step 1, Table 25b). [Un-squared deviations from the mean would add up to zero.]


The model’s operation gains meaning by relating the score and item mark distributions to a normal distribution. It compares observed data to what is expected from chance alone or as I like to call it, the know-nothing mean.

The expected know-nothing mean based on 0-wrong and 1-right with 4-option items (popular on standardized tests) is centered on 25%, 6 right out of 24 questions (Chart 62). This is from luck on test day alone (students only need to mark each item; they do not need to read the test) on a traditional multiple-choice test (TMC). The mean moves to 50% if student ability and item difficulty have equal value. It moves to 80% if students are functioning near the mastery level as seen in the Nursing124 data. The math model will adjust to fit these data.

The know-nothing mean, with Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM), is at 50% for a high quality student or 25% for a low quality student (same as TMC). Scoring is 0-wrong, 1-have yet to learn, and 2-right.  A high quality student accurately, honestly, and fairly reports what is trusted to be useful in further instruction and learning. There are few, if any, wrong marks. A low quality student performs the same on both methods of scoring by marking an answer on all items. Students adjust the test to fit their preparation.

The know-nothing mean for Knowledge Factor (KF) is above 75% (near the mastery level in the Nursing124 data, violet). KF weights knowledge and judgment as 1:3, rather than 1:1 (KJS) or 1:0 (TMC). High-risk examinees do not guess. Test takers are given the same opportunity as teachers and test makers to produce accurate, honest, and fair test scores.

The distribution of scores about the know-nothing mean are the same for TMC (green, Chart 63) and KJS (red, Chart 63). An unprepared student can expect, on average, a score of 25% on a TMC test with 4-option items. Some 2/3 of the time the score will fall within +/- 1 standard deviation of 25%. As a rule of thumb, the standard deviation (SD) on a classroom test tends to be about 10%. The best an unprepared student can hope for is a score over 35% (25 + 10) about 1/6 of the time ((1 - 2/3)/2).

The know-nothing mean (50%) for KJS and the PCRM is very different from TMC (25%) for low quality students. The observed operational mean at the mastery level (above 80%, violet) is nearly the same for high quality students electing either method of scoring. High quality students have the option of selecting items they can trust they can answer correctly. There are few to no wrong marks. [Totally unprepared high quality students could elect to not mark any item for a score of 50%.]

The mark patterns on the lower floor of the mathematical model have different meanings based on the scoring method. TMC delivers a score that only ranks the student’s performance on the test. KJS and the PCR deliver an assessment of what a student knows or can do that can be trusted as the basis for further learning and instruction. Quantity (number right) and quality (portion marked that are right) are not linked. Any score below 50% indicates the student has not developed a sense of judgment needed to learn and report at higher levels of thinking.

The score and item mark patterns are fed into the upper floor of the mathematical model as the squared deviation from the mean (d^2). [A positive deviation of 3 and a negative deviation of 3 both yield a squared deviation of 9.] The next step is to make sense of (to visualize, to relate) the distributions of the variance (MSS) from columns and rows.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, February 5, 2014

Test Scoring Mathematical Model

The seven statistics reviewed in previous posts need to be related to the underlying mathematics. Traditional multiple-choice (TMC) data analysis has been expressed entirely with charts and the Excel spreadsheet VESEngine. I will need a TMC math model to compare TMC with the Rasch model IRT that is the dominant method of data analysis for standardized tests.

A mathematical model contains the relationships and variables listed in the charts and tables. This post applies the advice in learning discussed in the previous post. It starts with the observed variables. The mathematical model then summarizes the relationships in the seven statistics.

The model contains two levels (Table 25). The first floor level contains the observed mark patterns. The second floor level contains the squared deviations from the score and item means; the variation in the mark patterns. The squared values are then averaged to produce the variance. [Variance = Mean sum of squares = MSS]

1. Count

The right marks are counted for each student and each item (question). TMC: 0-wrong, 1-right captures quantity only. Knowledge and Judgment Scoring (KJS) and the partial credit Rash model (PCRM) capture quantity and quality: 0-wrong, 1-have yet to learn this, 2-right.
Hall JR Count = SUM(right marks) = 20   
Item 12 Count = SUM(right marks) = 21  

2. Mean (Average)

The sum is divided by the number of counts. (N students, 22 and n items, 21)
The SUM of scores / N = 16.77; 16.77/n = 0.80 = 80%
The SUM of items / n = 17.57; 17.57/N = 0.80 = 80%

3. Variance

The variation within any column or row is harvested as the deviation between the marks in a student (row) or item (column) mark pattern, or between student scores, with respect to the mean value. The squared deviations are summed and averaged as the variance on the top level of the mathematical model (Table 25).
Variance = SUM(Deviations^2)/(N or n) = SUM of Squares/(N or n) = Mean SS = MSS

4. Standard Deviation

The variation within a score, item, or probability distribution expressed as a normal value that +/- the mean includes 2/3 of a normal, bell-shaped, distribution: 1 Standard Deviation = 1SD.

SD = Square Root of Variance or MSS = SQRT(MSS) = SQRT(4.08) = 2.02

For small classroom tests the (N-1) SD = SQRT(4.28) = 2.07 marks

The variation in student scores and the distribution of student scores are now expressed on the same normal scale.

5. Test Reliability

The ratio of the true variance to the score variance estimates the test reliability: the Kuder-Richardson 20 (KR20). The score (marginal column) variance – the error (summed from within Item columns) variance = the true variance.

KR 20 = ((score variance – error variance)/score variance) x n/1-n)
KR 20 = ((4.08 – 2.96)/4.08) x 21/20 = 0.29

This ratio is returned to the first floor of the model. An acceptable classroom test has a KR20 > 0.7. An acceptable standardized test has a KR20 >0.9.

6. Traditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall is the standard error of measurement (SEM). The traditional SEM is based on the average performance of your class: 16.77 +/- 1SD (+/- 2.07 marks).

SEM = SQRT(1-KR20) * SD = SQRT(1- 0.29) * 2.07 = +/-1.75 marks

On a test that is totally reliable (KR20 = 1), the SEM is zero. You can expect to get the same score on a retest.

7. Conditional Standard Error of Measurement

The range of error in which 2/3 of the time your retest score may fall based on the rank of your test score alone (conditional on one score rank) is the conditional standard error of measurement (CSEM). The estimate is based (conditional) on your test score rather than on the average class test score.

CSEM = SQRT((Variance within your Score) * n number of questions) = SQRT(MSS * n) = SQRT(SS)
CSEM = SQRT(0.15 * 21) = SQRT(3.15) = 1.80 marks

The average CSEM values (1.75) for all of your class (light green) also yields the test SEM. This confirms the above calculation for 6. Traditional Standard Error of Measurement for the test.

This mathematical model (Table 25) separates the flat display in the VESEngine into two distinct levels. The lower floor is on a normal scale. The upper floor isolates the variation within the marking patterns on the lower floor. The resulting variance provides insight into the extent that the marking patterns could have occurred by luck on test day and into the performance of teachers, students, questions, and the test makers. Limited predictions can also be made.

Predictions are limited using traditional multiple-choice (TMC) as students have only two options: 0-wrong and 1-right. Quantity and quality are linked into a single ranking. Knowledge and Judgment Scoring (KJS) and the partial credit Rasch model (PCRM) separate quantity and quality: 0-wrong, 1-have yet to learn, and 2-right. Students are free to report what they know and can do accurately, honestly, and fairly.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, January 29, 2014

Test Scoring Myths for Students

The best test is a test that permits you to accurately, honestly, and fairly report what you know and can do. You know how to question, to get answers, and to verify. You know what you know and what you have yet to learn. This operates at two levels of thinking. It is a myth that a forced choice multiple-choice test measures what you trust you know and can do.

At the beginning of any learning operation, you learn to repeat and to recall. Next you learn to relate the bits you can repeat and recall. By the end of a learning operation you have assembled a web of skills and relationships. You start at lower levels of thinking and progress to higher levels of thinking. Practice takes you from slow conscious operations to fast automatic responses (multiplication or roller skating). It is a myth that learning primarily occurs only by responding to a teacher in a classroom.

Your attitude during learning and testing is important. Your maturity is indicated by your ability to get interested in new topics or activities your teacher recommends (during the course). As a rule of thumb, a positive attitude is worth about one letter grade on a test. It is a myth that you can easily learn when you have a negative attitude.

Your expectations are important. You tend to get what you expect. A nine year study with over 3000 students indicated that students tend to get the grade they expected at the time they enrolled in the class, based on their lack of information, misinformation, and attitude. It is a myth that you cannot do better than your preconceived grade.

Learning and testing are one coordinated event when you can see the result of your practicing directly (target practice or skateboarding). This situation also occurs when you are directly tutored by a person or by a person’s software. It is a myth that you must always take a test separately from learning.

Complex learning operations go though the same sequence of learning steps. The rule of three applies here. Read or practice from one source to get the basic terms or actions. Read or practice from a second set to add any additional terms or actions. Read or practice from a third set to test your understanding, your web of knowledge and skill relationships. It is a myth that you must always have another person test your learning (but another person can be very helpful).

That other person is usually a teacher who cannot teach and test each pupil or student individually. The teacher also selects what is to be learned rather than letting you make the choice. The teacher also selects the test you will take. It is a myth that your teachers have the qualities needed to introduce you to the range of skills and knowledge required for an honest, self-supporting citizen.

Teaching usually takes place during scheduled time periods. In extreme situations, only what is learned in those scheduled time periods will be scored. This is one basis for assessing teacher effectiveness. It is a myth that the primary goal of traditional schools is student learning and development.

Traditional multiple-choice is defective. It was crippled when the option of no response, “do not know”, was eliminated when adapted from its use with animal experiments to make classroom scoring easier. It is a myth that you should not have this option to permit accurate, honest, and fair assessment.
Traditional multiple-choice promotes selecting the best right answer: using the lowest levels of thinking. The minimum requirement is making a mark for each question. It is a myth that such a score measures what you know or can do. The score ranks you on the test.

The average test score describes the test, not you. (Table 15 or Download)
Your score may rank you above or below average. It is a myth that you will always be safe with an above average score (passing).

The normal distribution of multiple-choice test scores is based on your luck on test day. The normal distribution is desired for classes in schools designed for failure. It is a myth that a class should not have an average score of 90%.

Luck on test day will distribute 2/3 of your classmates’ multiple-choice scores within the bubble in the center of a normal distribution; that is one standard deviation (SD) from the average. (Table 15 or Download) [SD = SQRT(Variance) and the Variance = SUM(Deviation from the Average^2)/N = Mean Sum of Squares = MSS]

Your grade (cut score) is set by marking off the distribution of classmate scores in standard deviations: F (<-2 b="" c="" d="" to="">+1); A (>+2). Your raw score grade is the sum of what you know and can do, your luck on test day, and your set of classmates.

Raw scores can be adjusted by shifting their distribution, higher or lower, and by stretching (or shrinking) the distribution to get a distribution that “looks right”. It is a myth that your teacher, can only select the right mix of questions, to get a raw score distribution that “looks right”.

Some questions perform poorly. They can be deleted and a new, more accurate, scored distribution created. It is a myth that every question must be retained.

Discriminating questions are marked right only by high scoring classmates and marked wrong by low scoring classmates. (Table 15 or Download) It is a myth that all questions should be discriminating.

Discriminating questions produce your class raw score distribution. About 5 to 10 are needed to create the amount of error that yields a range of five letter grades. It is a myth that discriminating questions assess mastery.

The reliability (reproducibility, precision) of your raw score can be predicted, but not your final (adjusted) score. Test reliability (KR20) is based on the ratio of variation (the variance) from between student scores (external column) and within question difficulty mark patterns (internal columns). (Table 15 or Download)

This makes sense: The smaller the amount of error variance within the question difficulty internal columns, with respect to the variance between student scores in the external column, the greater the test reliability. Discriminating, difficult, questions spread out student scores more (yield higher variance) than they increase the error variance within the questions. If there were no error variance, a test would be totally reliable (KR20 = 1).  It is a myth that a good informative test must maximize reliability.

The test reliability can help predict the average test score your class would get if it were to take another test over the same set of skills and knowledge. The Standard Error of Measurement (SEM) of your test is the range of error (from all of the above effects) for the average test score. (Table 15 or Download) The SD of the test and the test reliability are combined to obtain the SEM. The test reliability extracts a portion of the SD. If the test reliability were 1 (totally reliable), the SEM would be 0 (no error), the class would be expected to get the same class test score on a retest.

And finally what can you expect about the precision of your score and your retest score (providing you have not learned any more). A retest is of critical importance to students needing to reach a high stakes cut score. If the SEM or CSEM ranges widely enough, you do not need to study. Just retake the test a couple of times and your luck on test day may get you a passing score. It is a myth that the probability, of you getting a passing grade 2/3 of the time, will insure you get the passing grade if you need a second trial.

The Conditional [on your raw score] Standard Error of Measurement (CSEM) extracts the variance from only your mark pattern (Table 22). [CSEM = SQRT(Variance within your marks X the number of questions] Your CSEM will be very small if you have a very high or low score. This limits the prospects of a passing score by retaking a test without studying.

Now to study, to change testing habits, or to trust to luck on test day, before a retest. Get a copy of the blueprint used in designing the test. A blueprint lists in detail what will be covered and the type of questions. Question each topic or skill. It is easier to answer questions other people have written if you have already created and answered your own questions. Use the advice in the first five paragraphs above and work up into higher level of thinking, meaning making (a web of relationships that makes sense to you and visualize, sketch, draw, every term).

A change in testing habits may also be in order. Many students who do not “test well” are bright, fast memorizers, but lacking in meaningful relationships that make sense to themselves. They are still learning for someone else: the test and scanning each question for the “one right answer”. With meaningful relationships in mind you have the information in hand to answer a number of related questions. You are not limited to just matching what you recall to the question answers. [Mark out wrong answers and guess from the remaining answers.]

And now for the “Hail Mary” approach. First, as a rule of thumb, your score on a test written by someone other than your teacher (a standardized test for example) will be one to two letter grades below your classroom test scores. If your failing test score is within 1 SEM of the cut score, you can expect a retest score within this range 2/3 of the time. The same prediction is made with your CSEM value that can range above and below the SEM value. If your failing test score is below 1 SEM or 1 CSEM from the cut score, you have no option other than to study. It is a myth that students passing a few points above the cut score will also pass on a retest. [Near passes are safe. Near failures are not.]

Also please keep in mind that all of the math dealing with the variation between and within columns and rows (the variance) can be done on the student and question mark patterns with no knowledge of the test questions or the students. It is a myth that good statistical procedures can improve poor question or student performance. Teacher and psychometrician judgment on the other hand can do wonders!

The standardized test paradox: A good blueprint to guide calibrated question selection for the test is the basis for low scores and a statistically reliable test. Good student preparation is the basis for high scores (mastery) and a statistically unreliable test (it cannot spread student scores out enough for the distribution to “look right”).

The sciences, engineering, and manufacturing use statistics to reduce error to a minimum (low maintenance cars, aircraft, computers, and telephones). Only in traditional institutionalized education (schools designed for failure) is error intentionally introduced to create a score range that “looks right” for setting grades and ranking schools. This is all non-sense for schools designed for mastery (who advance students after they are prepared for the next steps). It is a myth (and an entrenched excuse for failure by the school) that student score distributions must fit a normal, bell-shaped, curve of error.

Mastery schools are now being promoted as the burden of record keeping is easily computerized. The Internet makes mastery schools available everywhere and at anytime. This will have a marked change in traditional schooling in the next few years. This change can be seen in the “flipped” classroom (a modern version of assigned [deep] reading before class discussion). It is a myth that the “flipped” classroom is something new.

Current educational software removes the time lag, in the question-answer-and-verify learning cycle, introduced by grouping students in classes, and then extended with standardized tests. Learning and assessment are again joined to promote mastery of assigned skills and knowledge.  Students advance when they are ready to succeed at the next levels. It is a myth that “formative assessments” are actually functional when test results are not available in an operational time frame (seconds to a few days).

Standardized tests will continue to rank students and schools, as the tests mature to certifying mastery for students who learn and excel anywhere and at anytime. It is a myth that current substantive standardized tests (that do not let students report what they trust they know or can do) can “pin point exactly what a student knows and needs to learn”.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, November 6, 2013

The Value and Meaning of a Mark

The bet in the title of Catherine Gewertz’s article caught my attention: “One District’s Common-Core Bet: Results Are In”. As I read, I realized that the betting that takes place in traditional multiple-choice (TMC) was being given arbitrary valuations to justify the difference between a test score and a classroom observation. If the two agreed, that was good. If they did not agree, the standardized test score was dismissed.

TMC gives us the choice of a right mark and several wrong marks. Each is traditionally given a value of 1 or 0. This simplification, carried forward from paper and pencil days, hides the true value and the meanings that can be assigned to each mark.

The value and meaning of each mark changes with the degree of completion of the test and the ability of the student. Consider a test with one right answer and three wrong answers. This is now a popular number for standardized tests.

Consider a TMC test of 100 questions. The starting score is 25, on average. Every student knows this. Just mark an answer to each question. Look at the test and change a few marks, that you can trust you know, to right. With good luck on test day, get a score high enough to pass the test.
If a student marked 60 correctly, the final score is 60. But the quality of this passing score is also 60%. 

Part of that 60% represents what a student knows and can do, and part is luck on test day. A passing score can be obtained by a student who knows or can do less than half of what the test is assessing; a quality below 50%. This is traditionally acceptable in the classroom. [TMC ignores quality. A right mark on a test with a score of 100 has the same value, but not the same meaning as a right mark on a test with a score of 50.]

A wrong mark can also be assigned different meanings. As a rule of thumb (based on the analysis of variance, ANOVA; a time honored method of data reduction), if fewer than five students mark a wrong answer to a question, the marks on the question can be ignored. If fewer that five students make the same wrong mark, the marks on that option can be ignored. This is why Power Up Plus (PUP) does not report statistics on wrong marks, but only on right marks. There is no need to clutter up the reports with potentially interesting, but useless and meaningless information.

PUP does include a fitness statistics not found in any other item analysis report that I have examined. This statistic shows how well the test fits student preparation. Students prepare for tests; but test makers also prepare for the abilities of test takers.

The fitness statistic estimates the score a student is expected to get if, on average, as many wrong options are eliminated as are non-functional on the test, before guessing; with NO KNOWLEDGE of the right answer. This is the best guess score. It is always higher than the design score of 25. The estimate ranged from 36% to 53%, with a mean of 44%, on the Nursing124 data.  Half of these students were self-correcting scholars. The test was then a checklist of how they were expected to perform.

With the above in mind, we can understand how a single wrong mark can be devastating to a test score. But a single wrong mark, not shared by the rest of the class can be taken seriously or ignored (just as a right mark, on a difficult question, by a low scoring student).

To make sense of TMC test results requires both a matrix of student marks and a distribution of marks for each question (Break Out Overview). Evaluating only an individual student report gives you no idea whither a student missed a survey question that every student was expected to answer correctly or a question that the class failed to understand.

Are we dealing with a misconception? Or a lack of performance related to different levels of thinking in class and on the test; or related to the limits of rote memory to match an answer option to a question? [“It’s the test-taking.”] When does a right mark also mean a right answer or just luck on test day? [“This guy scored advanced only because he had a lucky day.”]

Mikel Robinson, as an individual, failed the test by 1 point. Mikel Robinson, as one student in a group of students, may not have failed. [We don’t really know.] His score just fell on the low side of a statistical range (the conditional standard error of measurement; see a previous post on CSEM). Within this range, it is not possible to differentiate one student’s performance from another’s using current statistical methods and a TMC test design (students are not asked if they can use the question to report what they can trust they actually know or can do).

We can say, that if he retook the test, the probability of passing may be as high as 50%, or more, depending upon the reliability and other characteristics of the test. [And the probability of those who passed by 1 point, of then failing by one point on a repeat of the test, would be the same.]

These problems are minimized with accurate, honest, and fair Knowledge and Judgment Scoring (KJS). You can know when a right mark is a right answer using KJS or the partial credit Rasch model IRT scoring. You can know the extent of a student’s development: the quality score. And, perhaps more important, is that your students can trust what they know and can do too; during the test, as well as after the test. This is the foundation on which to build further long lasting learning. This is student empowerment.

Welcome to the KJS Group: Please register at Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS:, 606 KB or, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - - 

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, October 30, 2013

Growth Mindset

The article by Sarah D. Sparks,, starts with a powerful concept: “It’s one thing to say all students can learn, but making them believe it – and do it – can require a 180-degree shift in student’s and teacher’s sense of themselves and of one another.”

The General Studies Remedial Biology course I taught faced this challenge. The course was scheduled at night for three consecutive hours in a 120-seat lecture room. I refused to teach the course until the following arrangements were made:
  • The entire text was presented by cable online reading assignments in each dormitory room and by off-campus phone service.
  • One hour was scheduled for my lecture, after any student presentations related to the scheduled topic. 
  • One hour was scheduled for written assessment every other week.
  • One hour was scheduled for 10-minute student oral reports based on library research, actual research, or projects.

Students requested the assessment period be placed in the first hour instead of the second hour, after the first few semesters. This turned the course into a seminar for which students needed to prepare on their own before class.

Only Knowledge and Judgment Scoring (KJS) was used the first few semesters, with ready acceptance by the class. The policy of bussing in students from out of the Northwest Missouri region brought in protestors, “Why do we have to know what we know, when everywhere else on campus, we just mark, and the teacher tells us how many right marks we made?”

Offering both methods of scoring, traditional multiple-choice (TMC) and KJS, on the same test solved that problem. Students could select the method they felt most comfortable with; that matched their preparation the best.

The student presentations and reports were excellent models for the rest of the class. They showed the interest in the subject and the quality of work these students were doing to the entire class.
KJS provided the information needed to guide passive pupils alone the path to becoming self-correcting scholars. As a generality, that path took the shape of a backward J. First they made fewer wrong marks, next they studied more, and finally they switched from memorizing non-sense to making sense of each assignment.

Over time they learned they were now spending less time studying (reviewing everything) and getting better grades by making sense as they learned; they could actually build new learning on what they could trust they had learned. They could monitor their progress by checking their quality score and their quantity score. Get quality up, interest and motivation increase, and quantity follows.

The tradition of students comparing their score with that of the rest of the class to see if they were safe, or needed to study more, or had a higher grade than expected when enrolling in the course (and could take a vacation), was strong in the fall semester with the distraction of social groups, football and homecoming. The results of fall and spring semesters were always different.

There was one dismal failure. With the excellent monitoring of their progress in the course, the idea was advanced to recognize class scholars. These students, had in one combination or another of test scores and presentations, earned a class score that could not be changed by any further assessment. They had demonstrated their ability to make sense of biological literature (the main goal of the course, which, hopefully, would serve them well the rest of their lives, as well as, the habit of making sense of assignments in their other courses). The next semester all went as planned. Most continued in the class and some conducted study sessions for other students.

The following semester witnessed an outbreak of cheating. Today, Power Up Plus (PUP) gets its name by the original cheat checker added to Power UP. Cheating became manageable by the simple rule that any answer sheet that failed to pass the cheat checker would receive a score of zero. I offered to help any student who wished to protest the rule to the student disciplinary committee. No student ever protested.

[Cheating was handled in-class as any use of the university rules was not honored by the administration. You must catch individual students in the act. Computer cheat checkers had the same status as red light cameras do now. If more than one student is caught, the problem is with the instructor, not with the student. We cancelled the class scholar idea.]

We need effective tools to manage student “growth mindset”. The tools must be easy to use by students and faculty. Students need to see how other students succeed, to be comfortable in taking part, and be able to easily follow their progress when starting at the low end of academic preparation of knowledge, skills, and judgment (quality, the use of all levels of thinking).

A common thread runs through successful student empowerment programs: Effective instruction is based on what students actual know, can do, and want to do or to take part in. This requires frequent appropriate assessment at each academic level such as, in general, these recent examples:

Welcome to the KJS Group: Please register at Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS:, 606 KB or, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - - 

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):