Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Open in new window