Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Open in new window