Does Self-Evaluation Enable Wireheading in Language Models?

Open in new window