Goto

Collaborating Authors

 cbf-llm


Control Barrier Function for Aligning Large Language Models

arXiv.org Artificial Intelligence

Abstract--This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text. I. Introduction While large language models (LLMs) are known to have strong language understanding, reasoning and writing abilities, they can also generate harmful, biased, toxic, or unethical content [1], [2]. Alignment of LLMs ensures that they generate content that is "desirable" for user, meaning that the content is ethical and safe. Various approaches for LLM alignment have been presented (see the literature [1], [2], [3] and reference therein). The major approach to LLM alignment is reinforcement learning from human feedback (RLHF, [4]), where a reward model is constructed by human feedback and then used for the training of LLMs. Variants of RLHF methods are also proposed, such as Safe-RLHF by [5], SENSEI by [6], and f-DPG by [7], and their implementations are presented, such as training pre-trained LLMs [8], [9]. Collecting human feedback with data is time-consuming and expensive.


CBF-LLM: Safe Control for LLM Alignment

arXiv.org Artificial Intelligence

While large language models (LLMs) are known to have strong language understanding and generation abilities, they can also generate harmful, biased, and toxic content [1][2]. Alignment of LLMs ensures that they generate content that is "desirable" for the user, typically meaning content that is safe and ethical. Various approaches for LLM alignment have been presented ([1], [2], [3] and reference therein). The major approach to the alignment is reinforcement learning from human feedback (RLHF) [4], where a reward model is constructed by human feedback and used for the training of LLMs. Variants of RLHF architectures are also proposed, such as Safe-RLHF [5], SENSEI [6], and f-DPG [7], and their implementations are presented, such as training pre-trained LLMs [8][9], and applications like information-seeking chatbot [10].