Forecasting Rare Language Model Behaviors

Jones, Erik, Tong, Meg, Mu, Jesse, Mahfoud, Mohammed, Leike, Jan, Grosse, Roger, Kaplan, Jared, Fithian, William, Perez, Ethan, Sharma, Mrinank

Feb-23-2025–arXiv.org Artificial Intelligence

Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-23-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government > Commerce (0.92)
- Health & Medicine > Therapeutic Area
  - Infections and Infectious Diseases (0.93)
- Law (1.00)
- Materials > Chemicals (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (0.73)