Dishonesty in Helpful and Harmless Alignment

Huang, Youcheng, Tang, Jingkun, Feng, Duanyu, Zhang, Zheng, Lei, Wenqiang, Lv, Jiancheng, Cohn, Anthony G.

Jun-5-2024–arXiv.org Artificial Intelligence

Humans tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such phenomena at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that this dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Experimental results, including GPT-4 evaluated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jun-5-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.14)
  - United States > California
    - Los Angeles County > Long Beach (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Industry:
- Education (0.67)
- Government (1.00)
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology (0.45)
- Information Technology (0.93)
- Law (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found