LLMs become more covertly racist with human intervention

MIT Technology Review 

Even when the two sentences had the same meaning, the models were more likely to apply adjectives like "dirty," "lazy," and "stupid" to speakers of AAE than speakers of Standard American English (SAE). The models associated speakers of AAE with less prestigious jobs (or didn't associate them with having a job at all), and when asked to pass judgment on a hypothetical criminal defendant, they were more likely to recommend the death penalty. An even more notable finding may be a flaw the study pinpoints in the ways that researchers try to solve such biases. To purge models of hateful views, companies like OpenAI, Meta, and Google use feedback training, in which human workers manually adjust the way the model responds to certain prompts. This process, often called "alignment," aims to recalibrate the millions of connections in the neural network and get the model to conform better with desired values. The method works well to combat overt stereotypes, and leading companies have employed it for nearly a decade.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found