PROOF POINTS: A smarter robo-grader

The Hechinger Report is a national nonprofit newsroom that reports on one topic: education. Sign up for our weekly newsletters to get stories like this delivered directly to your inbox.

The best kind of expertise might be personal experience.

When the research arm of the U.S. Department of Education wanted to learn more about the latest advances in robo-grading, it decided to hold a competition. In the fall of 2021, 23 teams, many of them Ph.D. computer scientists from universities and corporate research laboratories, competed to see who could build the best automatic scoring model.

One of the six finalists was a team of just two 2021 graduates from the Georgia Institute of Technology. Prathic Sundararajan, 21, and Suraj Rajendran, 22, met during an introductory biomedical engineering class freshman year and had studied artificial intelligence. To ward off boredom and isolation during the pandemic, they entered a half dozen hackathons and competitions, using their knowhow in machine learning to solve problems in prisons, medicine and auto sales. They kept winning.

“We hadn’t done anything in the space of education,” said Sundararajan, who noticed an education competition on the Challenge.Gov website. “And we’ve all suffered through SATs and those standardized tests. So we were like, Okay, this will be fun. We’ll see what’s under the hood, how do they actually do it on the other side?”

The Institute of Education Sciences gave contestants 20 question items from the 2017 National Assessment of Educational Progress (NAEP), a test administered to fourth and eighth graders to track student achievement across the nation. About half the questions on the reading test were open-response, instead of multiple choice, and humans scored these sentences.

Rajendran, now a Ph.D. student at Weill Cornell Medicine in New York, thought he might be able to re-use a model he had built for medical records that used natural language processing to decipher doctors’ notes and predict patient diagnoses. That model relied on a natural language processing model called GloVe, developed by scientists at Stanford University.

Together the duo built 20 separate models, one for each open-response question. First they trained their models by having them digest the scores that humans had given to thousands of student responses on these exact same questions. One, for example, was:  “Describe two ways that people care for the violin at the museum that show the violin is valuable.”

When they tested their robo-graders, the accuracy was poor.

“The education context is different,” said Sundararajan. Words can have different meanings in different contexts and the algorithms weren’t picking that up.

Sundararajan and Rajendran went back to the drawing board to look for other language models. They happened upon BERT.

BERT is a natural language processing model developed at Google in 2018 (yes, they found it by Googling).  It’s what Google uses for search queries but the company shares the model as free, open-source code. Sundararajan and Rajendran also found another model called RoBERTa, a modified version of BERT, that they thought might be better. But they ran out of time before submissions were due on Nov. 28, 2021.

When the prizes were announced on Jan. 21, it turned out that all the winners had selected BERT, too. The technology is a sea change in natural language processing. Think of it like the new mRNA technology that has revolutionized vaccines.  Much the way Moderna and Pfizer achieved similar efficacy rates with their COVID vaccines, the robo-grading results of the BERT users rose to the top.

“We got extremely high levels of accuracy,” said John Whitmer, a senior fellow with the Federation of American Scientists serving at the Institute of Education Sciences. “With the top three, we had this very nice problem that they were so close, as one of our analysts said, you couldn’t really fit a piece of paper between them.”

Essays and open responses are notoriously difficult to score because there are infinite ways to write and even great writers might prefer one style over another. Two well-trained humans typically agreed on a student’s writing score on the NAEP test 90.5 percent of the time. The best robo-grader in this competition, produced by a team from Durham, N.C-based testing company, Measurement Inc., agreed with human judgment 88.8 percent of the time, only a 1.7 percentage point greater discrepancy than among humans.

Sundararajan and Rajendran’s model was in agreement with the humans 86.1 percent of the time, just 4 percentage points shy of replicating human scores. That earned them a runner-up prize of $1,250. The top three winners each received $15,000.

The older generation of robo-grading models tended to focus on specific “features” that we value in writing, such as coherence, vocabulary, punctuation or sentence length. It was easy to game these grading systems by writing gibberish that happened to meet the criteria that the robo-grader was looking for. But a 2014 study found that these “feature” models worked reasonably well.

BERT is much more accurate. However, its drawback is that it’s like a black box to laypeople. With the feature models, you could see that an essay scored lower because it didn’t have good punctuation, for example. With BERT models, there’s no information on why the essay scored the way it did.

“If you try to understand how that works, you’ve got to go back and look at the billions of relationships that are made and the billions of inputs in these neural networks,” said Whitmer.

That makes the model useful for scoring an exam, but not useful for teachers in grading school assignments because it cannot give students any concrete feedback on how to improve their writing.

BERT models were also bad at building a robo-grader that could grade more than a single question. As part of the competition, contestants were asked to build a “generic” model that could score open responses to any question. But the best of these generic models were only able to replicate human scoring half the time. It was not a success.

The upside is that humans are not going away. At least 2,000 to 5,000 human scores are needed to train an automated scoring model for each open-response question, according to Pearson, which has been using automated scoring since 1998. In this competition, contestants had 20,000 human scores to train their models. The time and cost savings kick in when test questions are re-used in subsequent years. The Department of Education currently requires humans to score student writing and it held this competition to help decide whether to adopt automated scoring on future administrations of the NAEP test.

Bias remains a concern with all machine learning models. The Institute for Education Science confirmed that Black and Hispanic students weren’t faring any worse with the algorithms than they were with human scores in this competition. The goal was to replicate the human scores, which could still be influenced by human biases. Race, ethnicity and gender aren’t known to human scorers on standardized exams, but it’s certainly possible to make assumptions based on word choices and syntax. By training the computerized models with fallible humans, we could be baking biases into the robo-graders.

Sundararajan graduated in December 2021 and is now working on blood pressure waveforms at a medical technology startup in California. After conquering educational assessment, he and Rajendran turned their attention to other timely challenges. This month, they won first place in a competition run by the Centers for Disease Control. They analyzed millions of tweets to see who had suffered trauma in their past and if their Twitter community is now serving as a helpful support group or a destructive influence.