Forcing LLMs to be evil during training can make them nicer in the long run

MIT Technology Review 

For this study, Lindsey and his colleagues worked to lay down some of that groundwork. Previous research has shown that various dimensions of LLMs' behavior--from whether they are talking about weddings to persistent traits such as sycophancy--are associated with specific patterns of activity in the simulated neurons that constitute LLMs. Those patterns can be written down as a long string of numbers, in which each number represents how active a specific neuron is when the model is expressing that behavior. Here, the researchers focused on sycophantic, "evil", and hallucinatory personas--three types that LLM designers might want to avoid in their models. To identify those patterns, the team devised a fully automated pipeline that can map out that pattern given a brief text description of a persona.