When prompted to give feedback on students’ essays, AI chatbots respond differently depending on the students’ race, according to a new study led by researchers in the Graduate School of Education at Stanford University in California.
The research team asked four different large language models (LLMs) to review 600 essays from middle school students and give writing feedback. Next, they submitted each essay to the LLMs 12 more times, but added additional info on the writer’s race, gender, and motivation level, as well as whether they had a learning disability.
Across all four models, essays by Black students were given more praise, including comments that validated students’ personal experiences and encouraged connecting arguments to lived realities. The LLMs were also more likely to encourage Black students to connect their essays to historical and systemic contexts.
Essays by Hispanic students received feedback emphasizing English language corrections, while essays by Asian students were given feedback that invoked cultural stereotypes, such as respect and education.
In contrast, essays by White students were more likely to receive constructive criticism that could enhance students’ writing skills, such as comments that discouraged first-person writing, emphasized objectivity, and provided insight into argument structure and evidence.
“Taken together, these patterns suggest that LLMs enact Marked Pedagogies guided by stereotypes rather than pedagogical best practices,” the authors conclude. “Instead of providing feedback at multiple levels of writing or consistently signaling constructive trust in students’ capacity to revise, LLMs differentially calibrate feedback based on presumed identities, judging not only students’ current ability, but also their capability — placing lower ceilings on growth for students of marked attributes and, by implication, constraining their educational futures.”

