Wednesday, June 26, 2013

Visual Education Statistics - Item Mark Patterns


                                                             16
The last post stated, “Lower individual PBR values result from mixing right and wrong marks in an item pattern. Wider score distributions make possible longer item mark patterns.” I was curious about just how does this happen?

I marked Item 30 in Table 19 with five locations. The top location contained four right marks (1s). This location was then changed to wrong marks (0s) and the four right marks were moved one count below. A visual education statistics engine (VESE) table was developed. This process was then repeated in each of the three lower locations.

The above process took an item with an unmixed mark pattern (14 right and 26 wrong) and mixed wrong marks into four lower locations, each with a one right count lower score. I moved four marks as it took this many to get a measurable result with all six statistics with the standard deviation (SD) set at 4 or 10% on a test with 40 students and 40 items (Chart 40).

I did the same thing with the SD set at 2 or 5% (Chart 41) where the effect on lowering the item PBR is greater. But a SD of 5% is not a realistic value. The effect of mixing right and wrong marks would be even less with the SD set at 8 or 20% with 40 students and 40 items. My assumption, at this point, is that the mixing of right and wrong marks will be of little concern in large tests such as standardized traditional multiple-choice (TMC) tests.

Chart 42 shows an interesting observation. Mixing just one count makes no change in the individual PBR for item 30. The reason for this can be seen in Table 19. When a right mark with a related student raw score of 30 is mixing with the next lower location of 29, the math is 30 -1 = 29 and 29 + 1 = 30. The student scores do not change. The students getting the scores do change.
The deeper the mixing, the further the right marks are moved down the student score scale, the lower the individual PBR. But the individual PBR increases the further an unmixed mark pattern descends or lengthens, up to a point.

Items 26 to 31 in Table 19 show how this happens. An S-shaped or sigmoid curve is etched into Table 19 with bold 1’s. Each item is less difficult as you go from item 31 to 26 (0.25 to 0.75). Each mark pattern lengthens linearly.

[The number of mark patterns was 10 at 5% student score SD and 20 at 10% student score SD.]

The PBR and individual variance increase to a point and then decrease (Chart 43). That point is the 70% average student score set for the test. The test score sets the limit for individual item PBRs. In this table, based on optimum conditions, that is 0.73 PBR which provides plenty of room for classroom tests that generally run from 0.10 to 0.50.

Item 29 shows a difficulty of 0.45 and variance of 0.25. Item 28 shows a difficulty of 0.55 and a variance also of 0.25. They fall equidistant from the item difficulty mean of 20 or 0.50. The junction of mean student score and mean item difficulty set the PBR limit.

This has practical implications. The further away the average student score is from 50%, the lower the limit on item discrimination (PBR).

In Table 19 an unmixed marking pattern can only be 12 counts long before it decreases. If the test score had been 50%, the marking pattern could have been 20 counts long and the PBR 100% (as shown in previous posts).

This all comes back to the need for discriminating items to produce efficient tests; tests using the fewest items to rank students using TMC. The problem is, we do not create discriminating items. We can create items, but it is student performance that develops their PBR. This provides useful descriptive information from classroom tests. The development of PBR values is often distorted with standardized tests under conditions that range from pure gambling to being severely stressful.

It does not have to be that way. By offering Knowledge and Judgment Scoring (KJS), or its equivalent, students can report what they actually know and can do; what they trust as the foundation for further learning and instruction. The test then reveals student quantity and quality, misconceptions, the classroom level of thinking, and teacher effectiveness; not just a ranking.

Most students can function with high quality even though the quantity can vary greatly. The quality goal of the CCSS movement can be assessed using current efficient technology once students are permitted to make an individualized, honest and fair report of their knowledge and skills using multiple-choice; just like they do on most other forms of assessment.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


  

Wednesday, June 19, 2013

Visual Education Statistics - Student Development


                                                             15
The visual education statistics engine (VESE) is now capable of producing a statistical signature for a course using traditional multiple-choice (TMC) and Knowledge and Judgment Scoring (KJS). 

I selected two scenarios that explore three consecutive tests in each one. All items are set for maximum discrimination (right and wrong marks are not mixed). All student score distributions are normal. Both courses start with an average score of 50% and end with an average score of 70%. A standard deviation of 10% is considered normal and convenient for setting grades.

The first scenario is a class that starts with students of relatively equal abilities (Chart 36). As the course progresses the score distribution widens. This is the natural consequence of the better students doing better and the poorer students lagging behind; a typical result when using TMC that primarily only ranks students. [A good example of how evolution actually works: the self-empowered survive.]

The second scenario is a class that starts with students spread out widely (Chart 37). As the course progresses the score distribution narrows. This is the natural consequence of good student development; one of the results from switching from TMC to KJS where students are empowered to report what they actually know and trust as the basis for further instruction and learning.

The statistical signatures I found are  Charts 38 and 39. In a traditional class the test reliability (KR20), the average item discrimination (PBR), the standard deviation (SD) and the standard error of measurement (SEM) all increased in value. The controlling factor was the spread of student scores.

The SD captures the spread of student scores. In these two scenarios the SD was set to increase or decrease with the average student score, as required by the score distributions in Charts 36 and 37. [The two signatures are not perfect continuations due to rounding errors and my inability to fit the 40 x 40 = 1600 marks under smooth normal curves.]

Individual item discrimination (PBR) is not the controlling factor as it has been set to the maximum for each item. [A visualization of individual  item PBR and  average item PBR is needed here. Lower individual PBR values result from mixing right and wrong marks in an item mark pattern. Wider score distributions (larger SDs) make possible longer item mark patterns. An item mark pattern is visualized in the next post.]

These statistical results are interesting. A traditional class ends with a test with increasing test reliability and a decreasing ability to separate student performance with the SEM. A class that ends with most students empowered (to question, to find answers, and to verify) shows low lower test reliability and an increasing ability to separate student performance with the SEM. This makes sense.

These two scenarios also shed light on teacher effectiveness. Both classes reached the traditional goal of mastery for schools designed for failure. The first, I would imagine, under the direction of traditional instruction aimed at the center of the class. The second would require either special attention to lower performing students or empowering most students to become self-correcting, high-achieving learners; the goal of the Common Core State Standards (CCSS) movement.

 - - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, June 12, 2013

Visual Education Statistics - Item Number Limits


                                                              14
Adding more items with the same difficulties to a perfect world test on the VES Engine (Table 18) did not change the average student score (50%), the standard deviation (SD of 15.39%), the standard error (SE of 3.44%) or the average item discrimination (PBR of 0.30). The test reliability (KR20) improved but the standard error of measurement (SEM) made a marked improvement (Chart 35).

This makes sense. The more items on a test the greater the test reliability; the greater the test reliability, the smaller the range into which repeated student testing scores can be expected to fall. By doubling the number of items, twice, 20 to 80 items, the SEM fell from 5.39% to 2.64%. By doubling twice again to 320 items, the value again was reduced by half to 1.32%.

The common core state standards (CCSS) movement is now bringing into practice testing with an average difficulty of 50%. This optimizes test performance, but bullies students.

A class of 20 students, IMHO, can produce usable results if eight 40-item tests are used during the course. With a SEM of 1.32%, scores from the same student would only need to be 1.32% x 3 = 3.94% apart to show acceptable improvement in performance.

Testing companies can then market a single test, with a total number of items from 80 to 160, which will rank students and teachers with acceptable precision based on test scores. Each student will have to read every item on paper. Computer adaptive testing (CAT) will generally require less than that number, which means, CAT students will not take the same test.

Again testing is optimized for the testing companies who are only being required to rank students. They can calibrate items on a group of representative students. They can then present different items, but comparable only in difficulty, as equivalent items. This only makes sense if every student has the same general background and preparation and is an average student with average luck on test day. The practice reduces individuality and eliminates creativity. It does not have to be that way.

Armed with the above ability to rank students, testing companies are also marketing more tests: formative, summative, and in between “submative” (neither formative nor summative). The same items can be used on all three. The difference is that the formative process takes place in such a timely manner that the student learns (in seconds to minutes at higher levels of thinking and in minutes to days at lower levels of thinking). The summative test measures what has happened, not what is being learned at the moment.

The “submative” test falls in between as a subtest, but again measures the past. IMHO it also hints that buying such a test is better, in the short term, for school administrators, than letting a good teacher assess in a normal classroom. Relying on short term, lower level of thinking, tests that only rank students does not promote the development students need to become successful self-educable high quality achievers. (CCSS movement multiple-choice test questions may be highly contrived requiring considerable problem solving skills, but are still scored easier than a bingo operation: good luck on finding the right answer, with 1/4 free instead of 1/25 free.)

It does not have to be that way. The very same items can be scored to promote student development; function as formative experiences, and provide immediate guidance for teaching. Just because testing companies can deliver high quality rankings does not mean we should limit the return on the time and money invested (by students, teachers, and tax payers) to just ranking. This cripples schooling. The decade of NCLB experiences present the evidence here.

As suggested in the previous post, we need more than 20 test items and IMHO a test scored for what students trust they actually know and can do such as Power UP Plus by Nine-Patch Multiple-Choice,  partial-credit Rasch model by Winsteps and Amplifire by Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


[I again checked the test reliability values with the Spearman-Brown prophecy formula (Table 19). At this high end of the range, they closely matched the results from the VES Engine. The test with 20 items made four predictions that were increasingly close to the observed (x1) test reliability.]


Wednesday, June 5, 2013

Visual Education Statistics - Student Number Limits


                                                             13
Adding more tests from students with the same abilities to the VES Engine (Table 18) did not change the average student score, standard deviation (SD), test reliability (KR20 or Pearson r), standard error of measurement (SEM) or average item discrimination (PBR). It does change the stability of the data. A rule of thumb is that data become reasonably stable when the count reaches 300.

Above 300 the count becomes representative of what can be expected if all possible students were tested. But no student or class wants to be representative. All want to be above average. All want their best luck on test day when using traditional multiple-choice (TMC).

Although individual students do not benefit from testing increasing numbers; teachers, schools, and test makers do. The SD divided by the square root of the number of tests yields the standard error of the test score mean (SE).

Chart 34 shows a slight curve for SD and SEM. This comes from dividing by N – 1 rather than N. The effect disappears above a count over 100. The SE is smaller than the SD and SEM and shows a marked change for the better as more tests are counted. It easily permits finding differences between groups of students when you use test enough students.

The SD, SEM and the SE have the same predictive distributions. About 2/3 of student scores are expected to fall within plus/minus one SD (15.39% for a test of 20 students) of the mean. If a student could repeat the test, with no learning from previous tests, 2/3 of the repeats would be expected to fall within plus/minus one SEM (5.39% for a test of 20 students) of the mean. These values (expected 2/3 of the time) cover too wide a range (30.78% and 10.78%) to permit separating individual student performance from year to year.

The SE is different. Starting with 20 students; SEM and SE are fairly close. But with 320 students the SE (0.84%) is five times more sensitive than the SEM (5.27%) in its ability to detect differences between groups than the SEM in its ability to detect differences in student ability.

These values are all from perfect world data (Table 18) where all students earn the same low score or high score. Item discrimination is set at the maximum. The test is performing at its best (average student score and item difficulty of 50%, test reliability at 0.877, and average item discrimination at 0.30). With only 20 items, these data indicate to me that individual student performance cannot be divided into different groupings by a perfect world SEM and therefore cannot be divided with actual classroom data either.

These data also put into question if the SE can separate group performance for individual classes, individual teachers and individual schools. The counts are just too small. Teachers with large classes, or with several sections, have an advantage over those with a small class.

Adding more students to a test is of little benefit to individual students. It is of benefit to teachers , schools, and test makers. For students we need more test items and IMHO a test scored for what students trust they actually know and can do such as Power UP Plus by Nine-Patch Multiple-Choice,  partial-credit Rasch model by Winsteps and Amplifire by Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):