An Auditing Test To Detect Behavioral Shift in Language Models

Richter, Leo, He, Xuanli, Minervini, Pasquale, Kusner, Matt J.

Oct-25-2024–arXiv.org Artificial Intelligence

As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples. Language models (LMs) can now achieve human-level performance in a wide range of tasks, including text summarization, machine translation, coding and even acting as AI scientists: generating hypotheses and designing experiments (Achiam et al., 2023; Katz et al., 2024; Lu et al., 2024; Zhang et al., 2024). Because of this, many sectors are looking for ways to use them to improve existing systems (Kasneci et al., 2023; Felten et al., 2023).

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Oct-25-2024

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom (0.28)
- North America (0.46)

Genre:
- Research Report (1.00)

Industry:
- Government (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Performance Analysis > Accuracy (0.68)
  - Natural Language
    - Chatbot (0.69)
    - Large Language Model (1.00)
  - Representation & Reasoning > Expert Systems (0.68)