Forcing LLMs to be evil during training can make them nicer in the long run

Aug-1-2025, 16:00:00 GMT–MIT Technology Review

For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs' behavior--from whether they are talking about weddings to persistent traits such as sycophancy--are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior. Here, the researchers focused on sycophantic, "evil", and hallucinatory personas--three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona.

artificial intelligence, large language model, natural language, (7 more...)

MIT Technology Review

Aug-1-2025, 16:00:00 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)