Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Open in new window