Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack
–Neural Information Processing Systems
We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than those of the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents distinct new challenges. It requires models (1) to utilize the contexts at a deeper level, rather than resorting to simple copying and pasting; (2) to navigate through long streams of evolving topics and tasks, proxying the complexities and dynamism of contexts in real-world scenarios. Additionally, Task Haystack inherits the controllability of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively.
Neural Information Processing Systems
May-28-2025, 16:22:00 GMT
- Country:
- Asia (1.00)
- Europe (1.00)
- North America > United States
- California (0.46)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (1.00)
- Research Report
- Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.45)
- Leisure & Entertainment (0.92)
- Media (0.68)
- Technology: