Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

Open in new window