CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

Neural Information Processing Systems 

The rapid progress in Large Language Models (LLMs) poses potential risks such as generating unethical content. Assessing the values embedded in LLMs' generated responses can help expose their misalignment, but this relies on reference-free value evaluators, e.g. Nevertheless, two key challenges emerge in open-ended value evaluation: the evaluator should adapt to changing human value definitions with minimal annotation, against their own bias (adaptability); and remain robust across varying value expressions and scenarios (generalizability). To handle these challenges, we introduce CLAVE, a novel framework that integrates two complementary LLMs: a large model to extract high-level value concepts from diverse responses, leveraging its extensive knowledge and generalizability, and a small model fine-tuned on these concepts to adapt to human value annotations. This dual-model framework enables adaptation to any value system using 100 human-labeled samples per value type. We benchmark the performance of 15 popular LLM evaluators and fully analyze their strengths and weaknesses.