CBF-LLM: Safe Control for LLM Alignment
–arXiv.org Artificial Intelligence
While large language models (LLMs) are known to have strong language understanding and generation abilities, they can also generate harmful, biased, and toxic content [1][2]. Alignment of LLMs ensures that they generate content that is "desirable" for the user, typically meaning content that is safe and ethical. Various approaches for LLM alignment have been presented ([1], [2], [3] and reference therein). The major approach to the alignment is reinforcement learning from human feedback (RLHF) [4], where a reward model is constructed by human feedback and used for the training of LLMs. Variants of RLHF architectures are also proposed, such as Safe-RLHF [5], SENSEI [6], and f-DPG [7], and their implementations are presented, such as training pre-trained LLMs [8][9], and applications like information-seeking chatbot [10].
arXiv.org Artificial Intelligence
Aug-28-2024
- Country:
- North America > United States
- Washington > King County > Seattle (0.04)
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- North America > United States
- Genre:
- Research Report (0.64)
- Technology: