Interpreting Learned Feedback Patterns in Large Language Models Luke Marks Amir Abdullah Clement Neo Rauno Arike David Krueger Philip T orr

Neural Information Processing Systems 

Learned Feedback Pattern (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found