Control Barrier Function for Aligning Large Language Models

Miyaoka, Yuya, Inoue, Masaki

arXiv.org Artificial Intelligence 

Abstract--This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text. I. Introduction While large language models (LLMs) are known to have strong language understanding, reasoning and writing abilities, they can also generate harmful, biased, toxic, or unethical content [1], [2]. Alignment of LLMs ensures that they generate content that is "desirable" for user, meaning that the content is ethical and safe. Various approaches for LLM alignment have been presented (see the literature [1], [2], [3] and reference therein). The major approach to LLM alignment is reinforcement learning from human feedback (RLHF, [4]), where a reward model is constructed by human feedback and then used for the training of LLMs. Variants of RLHF methods are also proposed, such as Safe-RLHF by [5], SENSEI by [6], and f-DPG by [7], and their implementations are presented, such as training pre-trained LLMs [8], [9]. Collecting human feedback with data is time-consuming and expensive.