Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Open in new window