Friday, March 2, 2007

Surprisingly, they cheat...

Who cheats?

Well, just about anyone, if the stakes are right. You might say to yourself, I don't cheat, regardless of the stakes. And then you might remember the time you cheated on, say, a board game. Last week. Or the golf ball you nudged out of its bad lie. Or the time you really wanted a bagel in the office break room but couldn't come up with the dollar you were supposed to drop in the coffee can. And then took the bagel anyway. And told yourself you'd pay double the next time. And didn't.

For every clever person who goes to the trouble of creating an incentive scheme, there is an army of people, clever and otherwise, who will inevitably spend even more time trying to beat it. Cheating may more may not be human nature, but it is certainly a prominent feature in just about every human endeavor. Cheating is a primordial economic act: getting more for less. So it isn't the boldfact names- inside-trading CEOs and pill-popping ballplayers and perk-abusing politicians- who cheat. It is the waitress who pockets her tips instead of pooling them. It is the Wal-Mart payroll manager who goes into the computer and shaves his employees' hours to make his own performance look better. It is the third grader who, worried about not making it to the fourth grade, copies test answers from the kid sitting next to him.

Some cheating leaves barely a shadow of evidence. In other cases, the evidence is massive. Consider what happend one spring evening at midnight in 1987: seven million American children suddenly disappeared. The worst kidnapping wave in history? Hardly. It was the night of April 15, and the Internal Revenue Service had just changed a rule. Instead of merely listing each dependent child, tax filers were now required to provide a Social Security number for each child. Suddenly, seven million children- children who had existed only as phantom exemptions on the previous year's 1040 forms- vanished, representing about one in ten of all dependent children in the United States.

The incentive for those cheating taxpayers was quite clear. The same for the waitress, the payroll manager, and the third grader. But what about that third grader's teacher? Might she have an incentive to cheat? And if so, how would she do it?

Imagine now that you are running the Chicago Public Schools, a system that educates 400,000 students each year.

The most volatile debate among American school administrators, teachers, parents, and students concerns "high-stakes" testing. The stakes are considered high because instead of simply testing students to measure their progress, schools are increasingly held accountable for the results.

The federal government mandated high-stakes testing as part of the No Child Left Behind law, signed by President Bush in 2002. But even before that law, most states gave annual standardized tests to students in elementary and secondary school. Twenty states rewarded individual schools for good test scores or dramatic improvement; thirty-two states sanctioned the schools that didn't do well.

The Chicago Public School system embraced high-stakes testing in 1996. Under the new policy, a school with low reading scores would be placed on probation and face the threat of being shut down, its staff to be dismissed or reassigned. The CPS also did away with what is known as social promotion. In the past, only a dramatically inept or difficult student was held back a grade. Now, in order to be promoted, every student in third, sixth, and eighth grade had to manage a minimum score on the standardlized, multiple-choice exam known as the Iowa Test of Basic Skills.

Advocates of high-stakes testing argue that it raises the standards of learning and gives students more incentive to study. Also, if the test prevents poor students from advancing without merit, they won't clog up the higher grades and slow down good students. Opponents, meanwhile, worry that certain students will be unfairly penalized if they don't happen to test well, and that teachers may concentrate on the test topics at the exclusion of more important lessons.

Schoolchildren, of course, have had incentive to cheat for as long as there have been tests. But high-stakes testing has so radicially changed the incentives for teachers that they too now have added reason to cheat. With high-stake testings, a teacher whose students test poorly can be censured or passed over for a raise or promotion. If the entire school does poorly, federal funding can be withheld; if the school is put on probation, the teacher stands to be fired. High-stakes testing also presents teachers with some positive incentives. If her studnets do well enough, she might find herself praised, promoted, and even richer: the state of California at one point introduced bonuses of $25,000 for teachers who produced big test-score gains.

And if a teacher were to survey this newly incentivized landscape and consider somehow inflating her students' scores, she just might be persuaded by one final incentive: teacher cheating is rarely looked for, hardly ever detected, and just about never punished.

How might a teacher go about cheating? There are any number of possibilities, from the brazen to the sophisticated. A fifth-grade student in Oakland recently came home from school and gaily told her mother that her super-nice teacher had written the answers to the state exam right there on the chalkboard. Such instances are certainly rare, for planing your fate in the hands of thirty prepubescent witnesses doesn't seem like a risk that even the worst teacher would take. (The Oakland teacher was duly fired.) There are more subtle ways to inflate students' scores. A teacher can simply give students extra time to complete a test. If she obtains a copy of the exam early- that is, illegitimately- she can prepare them for specific questions. More broadly, she can "teach to the test," basing her lesson plans on questions from past years' exams, which isn't considered cheating but certainly violates the spirit of the test. Since these tests all have multiple-choice answers with no penalty for wrong guesses, a teacher might instruct her students to randomly fill in every blank as the clock is winding down, perhaps inserting a long string of Bs or an alternating patter of Bs and Cs. She might even fill in the blanks for them after they've left the room.

But if a teacher REALLY wanted to cheat- and make it worth her while- she might collect her students' answer sheets and, in the hour or so before turning them in to be read by an electronic scanner, erase the wrong answers and fill in the correct ones. (And you always thought that no. 2 pencil was for the children to change their answers.) If this kind of teacher cheating is truly going on, how might it be detected?

To catch a cheater, it helps to think like one. If you were willing to erase your students' wrong answers and fill in correct ones, you probably wouldn't want to change too many wrong answers. That would clearly be a tip-off. You probably wouldn't even want to change answers on every student's test- another tip-off. Nor, in all likelihood, would you have enough time, because the answer sheets are turned in soon after the test is over. So what you might do is select a string of eight or ten consecutive questions and fill in the correct answers for, say, one-half or two-thirds of your students. You could eastily memorize a short pattern of correct answers, and it would be a lot faster to erase and change that pattern than go through each student's answer sheet individually. You might even think to focus your activitiy toward the end of the test, where the questions tend to be harder than the earlier questions. In that way, you'd be most likely to substitute correct answers for wrong ones.

If economics is a science primarily concerned with incentives, it is also- fortunately- a science with statistical tools to measure how people respond to those incentives. All you need are some data.

In this case, the Chicago Public School system obliged. It made available a database of the test answers for every CPS student from third grade through seventh grade from 19993 to 2000. This amounts to roughly 30,000 students per grade per year, more than 700,000 sets of test answers, and nearly 100 million individual answers. The data, organized by classroom, included each student's question-by-question answer strings for reading and math tests. (The actual paper answer sheets were not included; they were habitually shredded soon after a test.) The data also included some information about each teacher and demographic information for every student, as well as his or her past and future test scores- which would prove a key element in detecting a teacher cheating.

Now it was time to construct an algorithm that could tease some conclusions from this mass of data. What might a cheating teacher's classroom look like?

The first thing to search for would be unusual patters in a given classroom: blocks of identical answers, for instance, especially among the harder questions. If ten very bright students (as indicated by past and future test scores) gave correct answers to the exam's first five questions (typically the easiest ones), such an identical block shouldn't be considered suspicious. But if ten poor students gave correct answers to the LAST five questions on the exam (the hardest ones), that's worth looking into. Another red flag would be a strange pattern within any one student's exam- such as getting the hard questions right while missing the easy ones- especially when measured against the thousands of students in other classrooms who scored similarly on the same test. Furthermore, the algorithm would seek out a classroom full of students who performed far better than their past scores would have predicted and who then went on to score significantly lower the following year. A dramatic one-year spike in test scores might initially be attributed to a GOOD teacher; but with a dramatic fall to follow, there's a strong likelihood that the spike was brought about by artificial means.

Consider now the answer strings from the students in two sixth grade Chicago classrooms who took the identical math test. Each horizontal row represents one student's answers. The letter a, b, c, or d indicates a correct answer; a number indicates a wrong answer, with 1 corresponding to a, 2 corresponding to b, and so on. A zero represents an answer that was left blank. One of these classrooms almost certainly had a cheating teacher and the other did not. Try to tell the difference- although be forewarned that it's not easy with the naked eye.

Classroom A

112a4a342cb214d0001acd24a3a12dadbcb4a000000
d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb
1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141
dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000
d12443d43232d32323c213c22d2c23234c332db4b300
db2abad1acbdda212b1acd24a3a12dadbcb400000000
d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1
1b33b4d4a2b1dadbc3ca22c00000000000000000000
d43a3a24acb1d32b412acd24a3a12dadbcb422143bc0
313a3ad1ac3d2a23431223c000012dadbcb40000000
db2a33dcacbd32d313c21142323cc30000000000000
d43ab4d1ac3dd43421240d24a3a12dadbcb40000000
db223a24acb11a3b24cacd12a241cdadbcb4adb4b300
db4abadcacb1dad3141ac212a3a1c3a1144ba2db41b43
1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4
214ab4dc4cbdd31b1b2213c4ad412dadbcb4adb00000
1423b4d4a23d241314123a243a2413a214413430000
32badc2300adc3211caa321acd24a3a12dadbcb400cda
2211cddadc3400ad34221d224a24a3a12dadbcb2de11
342adbbc3da32110033acd24a3a12dadbcb400adcc12
1acdd2332aabb2321230000dc1212dadbcb4acc2331a
acc223122c00bcba3241a123a3a12dadbcb400a231a3


Classroom B

a1221acd213acdd21a3bbc3aacdd122100231000000
12213aa4abc44000231accbac231ca4bac0000000000
bcb23acd4421ba12cca432a23ab23acca21230000a00
bac312bb0032a211acddba121accaba3200a0000000
acaa3212220ad0cadbb21ab2b1acaa2101000000000
a213321000cad3112ad23bca21100123acad2100000
ad231231a1a3bacadad00acad0a2123acd3a21a00000
adcad00ad213ad3131ad33aca144ad000acad002123a
acad2212bca3b3bab1aca1200ad1ad24123aca13100a
123ac3a2123ac44100acabba12300aca114300012300
accebba212300a44123acaba2130123aca3310ca0000
122100134baac31bb34ac21100acba21aca210000010
12bbac2b31b21b101bacadaad212344acab1001ca000
12cacade123ba00a12aca12341bccada00123c000010
adacaar1200a0daccabba1230daad012aca2200a0a410
acaa0cadadaba12312b2b2cabaac1200acad000000000
acababaa00123aca2312aaccbbbaca001231a24412300
a4121323acaabbacaaa000123aacabcbbcabcadd00000
aca1b1b131aca4412babac00aaca11adaa33123000000
00aca1baba2311200aca1101abaca231230012310001
a1231caca212bbcaca21230001231001bacaca32aca00
123100adcacaddbbaca123000aca00001321bbcacca00

If you guess that classroom A was the cheating classroom, congratulations. Here again are the answer strings from classroom A, no reordered by a computer that has been asked to apply the cheating algorithm and seek our suspicious patterns.

Classroom A (With cheating algorithm applied)

112a4a342cb214d0001ACD24A3A12DADBCB4A000000
d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb
1b2a34d4ac42d23b141ACD24A3A12DADBCB4A2134141
dbaab3dcacb1dadbc42ac2cc31012DADBCB4Adb40000
d12443d43232d32323c213c22d2c23234c332db4b300
db2abad1acbdda212b1ACD24A3A12DADBCB400000000
d4aab2124cbddadbcb1a42cca3412DADBCB423134bc1
1b33b4d4a2b1dadbc3ca22c00000000000000000000
d43a3a24acb1d32b412ACD24A3A12DADBCB422143bc0
313a3ad1ac3d2a23431223c000012DADBCB40000000
db2a33dcacbd32d313c21142323cc30000000000000
d43ab4d1ac3dd43421240d24A3A12DADBCB40000000
db223a24acb11a3b24cACD12A241DADBCB4Adb4b300
db4abadcacb1dad3141ac212a3a1c3a1144ba2db41b43
1142340c2cbddadb4b1ACD24A3A12DADBCB43d133bc4
214ab4dc4cbdd31b1b2213c4aA412DADBCB4Adb00000
1423b4d4a23d241314123a243a2413a214413430000
32badc2300adc3211caa321ACD24A3A12DADBCB400cda
2211cddadc3400ad34221d224A24A3A12DADBCB2de11
342adbbc3da32110033ACD24A3A12DADBCB400adcc12
1acdd2332aabb2321230000dc1212DADBCB4Acc2331a
acc223122c00bcba3241a123A3A12DADBCB400a231a3

Take a look at the capitalized answers. Did sixteen out of twenty-two students somehow manage to reel off the same six consecutive correct answers (the d-a-d-b-c-b string) all by themselves?

There are atleast four reasons why this is unlikely. One: those questions coming near the end of the test, were harder than the earlier questions. Two: these were mainly subpar students to begin with, few of whom got six consecutive right answers elsewhere on the test, making it all the more unlikely they would get right the same six hard questions. Three: up to this point in the test, the fifteen students' answers were virtually uncorrelated. Four: three of the students (number 1, 9, and 12) left at least one answer blank BEFORE the suspicious string and then ended the test with another string of blanks. This suggests that a long, unbroken string of blank answers was broken not by the student but by the teacher.

There is another oddity about the suspicious answer string. On nine of sixteen tests , the six correct answers are preceded by another identical string, 3-a-1-2, which includes three of four INCORRECT answers. And on all sixteen tests, the six correct answers are followed by the same incorrect answer, a 4. Why on earth would a cheating teacher go to the troulbe of erasing a student's test sheet then fill in the WRONG answer?

Perhaps she is merely being strategic. In case she is caught and hauled into the principal's office, she could point to the wrong answers as proof that she didn't cheat. Or perhaps- and this is a less charitable but likely answer- she doesn't know the right answer herself. (With standardlized tests, the teacher is typically not given an answer key.) If this is the case, then we have a pretty good clue as to why her students are in need of inflated grades in the first place: they have a bad teacher.

Another indication of a teacher cheating in classroom A is the class's overall performance. As sixth graders who are taking the test in the eighth month of the academic year, these students needed to achieve an average score of 6.8 to be considered up to national standards. (Fifth graders taking the test in the eighth month of the year needed to score 5.8, seventh graders 7.8, and so on.) The students in classroom A averaged a 5.8 on their sixth grade-tests, which is a full grade level below where they should be. So plainly these are poor students. A year earlier, however, these students did even worst, averaging just 4.1 on the fifth-grade tests. Instead of improving by one full point between fifth and sixth grade, as would be expected, they improved by 1.7 points, nearly two grades' worth. But his miraculous improvement was short lived. When these sixth-grade students reached seventh grade, they averaged 5.5- more than two grade levels below standard and even WORSE than they did in sixth grade. Consider the erratic year-to-year scores of three particular students from classroom A:

5th Grade Score 6th Grade Score 7th Grade Score
Student 3 3.0 6.5 5.1
Student 6 3.6 6.3 4.9
Student 14 3.8 7.1 5.6

The three-year scores from classroom B, meanwhile are also poor but atleast indicate an honest effort: 4.2, 5.1, 6.0. So an entire roomful of children in classroom A suddenly got very smart one year and very dim the next, or more likely, their sixth-grade teacher worked some magic with a no. 2 pencil.

There are two noteworthy points to be made about the children in classroom A, tangential to cheating itself. The first is that they are obviously in terrible academic shape, which makes them the very children whom high-stakes testing is promoted as helping the most. The second point is that these students would be in for a terrible shock once they reached the seventh grade. All they knew was that they had been successfully promoted due to their test scores. (No child left behind, indeed.) THEY weren't the ones who artificially jacked up their scores; they probably expected to do great in the seventh grade- and then they failed miserably. This may be the cruelest twist yet in high-stakes testing. A cheating teacher may tell herself that she is helping her students, but the fact is that she would appear far more concerned with helping herself.

An analysis of the entire Chicago data reveals evidence of teacher cheating in more than two hundred classrooms per year, roubly 5 percent in total. This is a conservative estimate, since the algorithm was able to identify only the most egregious forms of cheating- in which teachers systemmatically changed students' answers- and not the many subtler ways a teacher might cheat. In a recent study among North Carolina schoolteachers, some 35 percent of the respondents said they had witnessed their colleagues cheating in some fashion, whether by giving students extra time, suggesting answers, or manually changing students' answers.

What are the characteristics of a cheating teacher? The Chicago data shows that male and female teachers are about equally prone to cheating. A cheating teacher tends to be younger and less qualified than average. She is also more likely to cheat after her incentives change. Because the Chicago data ran from 1993 to 2000, it bracketed the introduction of high-stakes testing in 1996. Sure enough, there was a pronounced spike in cheating in 1996. Nor was the cheating random. It was the teachers in the lowest-scoring classrooms who were most likely to cheat. It should also be noted that the $25,000 bonus for California teachers was eventually revoked, in part because of suspicions that too much of the money was going to cheaters.

Not every result of the Chicago cheating analysis was so dour. In addition to detecting cheaters, the algorithm could also identify the best teachers in the school system. A good teacher's impact was nearly as distinctive as a cheater's. Instead of getting random answers correct, her students would show real improvement on the easier types of question they had previously missed, an indication of actual learning. And a good teacher's students carried over all their gains into the next grade.

Most academic analyses of this sort tend to languish, unread, on dusty library shelf. But in early 2002, the new CEO of the CPS, Arne Duncan, contacted the study's authors. He didn't want to protest or hush up their findings. Rather, he wanted to amek sure that the teachers identified by the algorithm as cheaters were truly cheating- and then do something about it.

Duncan was an unlikely candidtate to hold such a powerful job. He was only thirty-six when apointed, a onetime academic all-American at Harvard who later played pro baketball in Australia. He had spent just three years with the CPS- and never in a job important enough to have his own secretary- before becoming its CEO. It didn't hurt that Duncan had grown up in Chicago. His father taught psychology at the University of Chicago; his mother ran an afterschool program for forty years, without pay, in a poor neighborhood. When Duncan was a boy, his afterschool playmates were the underpriveleged kids his mother cared for. So when he took over the public schools, his allegiance lay more with the schoolchildren and their families than with the teachers and their union.

The best way to get ride of cheating teachers, Duncan had decided, was to readminister the standardized exam. He only had the resources to retest 120 classrooms, however, so he asked the creators of the cheating algorithm to help choose which classrooms to test.

How could those 120 retests be used most efficiently? It migh have seemed sensible to retest only the classrooms that likely had a cheating teacher. But even if their retest scores were lower, the teachers could argue that the students did worse merely because they were told that the scores wouldn't count in their official record- which, in fact, all retested students would be told. To make the retest results convincing, some non-cheaters were needed as a control group. The best control group? The classrooms shown by the algorithm to have the best teacher, in which big gains were thought to have been legitimately attained. If those classrooms held their gains while the classrooms with suspected cheater lost ground, the cheating teacheters could hardly argue that their students did worse only because the scores wouldn't count.

So a blend was settled upon. More than half of the 120 retested classrooms were those suspected of having a cheating teacher. The remainder were divided between the supposedly excellent teachers (high scores but no suspicious answer patters) and, as a further control, classrooms with mediocre scores and no suspicious answers.

The retest was given a few weeks after the original exam. The children were not told the reason for the retest. Neither were the teachers. But they may have gotten the idea when it was announced that CPS officials, not the teachers, would administer the test. The teachers were asked to stay in the classroom with the students, but they would not be allowed to even touch the answer sheets.

The results were as compelling as the cheating algorithm had predicted. In the classrooms chosen as controls, where no cheating was suspected, scores stayed about the same or even rose. In contrast, the students with the teachers identified as cheaters scored far worse, by an average of more than a full grade level.

As a result, the Chicago Public School system began to fire its cheating teachers. The evidence was only strong enough to get ride of a dozen of them, but the many other cheaters had been duly warned. The final outcome of the Chicago study is further testament to the power of incentives: the following year, cheating by teachers fell more than 30 percent.