Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

May-28-2025, 16:22:00 GMT–Neural Information Processing Systems

We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate through long streams of evolving topics and tasks, proxying the complexities and dynamism of contexts in real-world scenarios. Additionally, Task Haystack inherits the controllability of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.

computational linguistic, large language model, machine learning, (18 more...)

Neural Information Processing Systems

May-28-2025, 16:22:00 GMT

Conferences PDF

Add feedback

Country:
- Asia (1.00)
- Europe (1.00)
- North America > United States
  - California (0.46)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (1.00)

Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.45)
- Leisure & Entertainment (0.92)
- Media (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)