FLUNKING THE TEST
Dayton Daily News
The computer grader gave a top score to my banana-eating purple imaginary friend, even if the description made no sense in a middle school-level essay.< But according to the computer, the gibberish-filled essay displayed better writing than an on-topic, sentimental ode to summers past - an essay for which I gave my best effort. As the testing industry hurtles toward more computer scoring of student essays, a Dayton Daily News examination of one of the leading computer scoring programs revealed serious flaws. To control costs, testing companies are spending millions to develop essay exams that do not need human scorers. Ohio has begun exploring the technology for possible use on state-sponsored tests, asking its testing contractors to experiment with automated essay scoring programs as part of their contracts. "It's clearly the future," said Mitchell Chester, Ohio's assistant state school superintendent who oversees the testing program. "We've asked our contractors to pilot it so we can get a better handle on it." Such programs are far from perfect, as the Dayton Daily News test showed. Major test companies like CTB/McGraw Hill and the Educational Testing Service offer local school districts - including some in the Miami Valley - the option to have some of their students take diagnostic tests like the Stanford 9 over the Internet, where writing can be scored by computer. To test one such program, the Daily News submitted an essay intentionally filled with nonsense. The Educational Testing Service's Criterion computer program scored it a 6 out of 6. Another essay, written with earnest effort, was scored a 5 out of 6 from the computer. Critics said the Dayton Daily News ' comparisons demonstrate the testing industry may be pushing too hard to cut costs. James Popham, a retired UCLA professor, former head of a testing company and author of books about standardized testing, finds it "troubling" that major companies use what he said are flawed computer scoring programs even on a limited basis. "What you have demonstrated . . . is that this particular scoring system stinks," he said. "It suggests that the profit motive has overridden good sense and professional integrity." Defenders of the technology argue that computers can judge the quality of writing, but further advances are needed to make the scoring more reliable. Randy Bennett, an Educational Testing Service researcher, cautioned that electronic testing and scoring is still in its infancy. "I don't doubt that some can be fooled quite easily," he said of today's essay scoring programs. "Some essay scoring software does this reasonably well, though not as well as carefully done human grading." Computer: Nonsense essay perfect In the newspaper's test, a professional writer with 14 years experience - me - first tried to write an essay that would receive a top score. The computer wasn't overly impressed. The scoring report said my writing was a "strong essay" but graded it just a 5 out of 6, saying I needed to work on using language more effectively to signal direction and tie my argument together. The computer also said better sentences could improve my effectiveness. Meanwhile, the computer gave a 6 to the essay I intentionally filled with nonsense, but written in a similar format to an Educational Testing Service essay that earned a perfect 6. My nonsense essay was long - 342 words, much like the 318-word ETS example - and it included the attributes the computer is programmed to look for: long paragraphs, transitional words, and a vocabulary a bureaucrat would love. The computer was unfazed by bizarre and illogical references throughout the essay to workout video king Richard Simmons, a shoe-eating television interloper, alien beings and green swamp toads. Rachael Murdock, an English teacher at Dayton's Stivers School for the Arts, apparently prefers essays that make a point. She scored my best effort essay a 6+ on a scale of 1-6. The nonsense essay received a 1. The newspaper asked Murdock to grade three essays - my best try, an example of a top-scoring essay from ETS and the essay I intentionally filled with nonsense. Murdock is one of about 2,000 Ohio teachers to earn the rigorous certification of the National Board of Professional Teaching standards, which tests content knowledge and reviews classroom practices of applicants. Murdock praised my serious essay for its "sophisticated rhetoric that builds a persuasive argument" and for its "strong individual voice." She cited it for "consistent control over the elements of composition" with good, specific examples that "act as imbedded transitions, connecting the parts of the essay." The ETS example - the one the computer scored a 6 - received a 4 from Murdock. She said it was "organized," but with "no meaningful connections between parts." She said the essay's "transitions are formulaic" and the essay is "very general and lacks examples and original insights." "Clearly, essays are rewarded for technical conformity but punished for creative thought and original composition," she said. "An essay with a variety of sentence structures and some transition words is going to score well, whether it makes sense or not." As for the nonsense essay, Murdock said it "lacks any coherent organization, with no clear transitions from point to point, paragraph to paragraph, or sentence to sentence." The nonsensical phrases, she said, "are extremely distracting." Scoring system sticks to basics The computer's failure to recognize nonsense in an essay is less of a failing of technology than it is an indictment of the simplistic methods used by testing companies to score hundreds of essays a day by hand, said Alfie Kohn, a critic of standardized testing and author of a new book, What Does It Mean To Be Well Educated? The North Carolina-based workers who score Ohio writing tests are hired seasonally and paid $9.50 an hour. They are required only to have a college degree by Measurement Inc., the Durham, N.C., company that hires them. Like the computer, human scorers make judgments about essay quality that don't require the careful reading needed to consider the creativity or impact of the student's argument, critics said. In writing software to score essays, companies attempt to mimic human scorers, many of whom have never been teachers. The computer scoring system sticks to the basics of writing. "The complexity of even an 8-year-old's thoughts are beyond them," Kohn said. "There's more concern for punctuation than ideas." But defenders of the technology point out that the computer programs compare well with human graders in studies. In fact, the Graduate Management Admissions Test, used to qualify students for business school, already uses computerized scoring, with essays graded by both computer and human scorers. Rich Swartz, ETS' head of technology products, said computer programming begins with a study of up to 1,000 essays scored by human readers. The essays are examined for patterns of language to find those that are common to essays in each scoring category. From about 70 common essay attributes, an analysis yields about 12 that are the best predictors of essays that would be scored best by humans. Vocabulary words are especially important. "A lot of the prediction is based on the words students use to answer the question," he said. "Students at the top of the scale are using a different set of words than students at the bottom of the scale." Human consequences Computerized essay scoring is inevitable as the standards movement grows, said Tom Lasley, dean of the University of Dayton's college of education. "The problem we've got in the U.S. is we've created an assessment mania and good assessment is expensive," Lasley said. In Indiana, the first state to incorporate ETS' computer essay scoring, a study estimated a 25 percent savings in test scoring costs the first year. Lasley said the basic essay format encouraged by standardized tests doesn't recognize creativity, but has value because students learn to organize their writing. But he said high-stakes tests - those used for major decisions like whether a student should graduate - must be scored by people rather than machines. "When you make a decision about a young person's life with significant consequences, we've got an obligation to have human eyes look at a human product that has human consequences," he said. Swartz said the technology is quickly improving. The latest version of Criterion, due out at summer's end, kicked out my nonsense essay for human evaluation because it suspected it was gibberish, he said. They're now programmed to look for attributes common to phony essays, he said. "None of these computer programs can read," Swartz said. "They cannot distinguish sense from nonsense. It will get better, but we're a long way from computers actually reading stuff." Staff writer Mark Fisher contributed to this story.
Copyright © 2011 Cox Media Group Ohio, Dayton, Ohio, USA. All rights reserved.
By using this site, you accept the terms of our Visitors Agreement and Privacy Policy. You may wish to note our other business policies.