ACER: Automatic Language Model Context Extension via Retrieval

Gao, Luyu, Zhang, Yunyi, Callan, Jamie

arXiv.org Artificial Intelligence 

Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model (LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy retrieval stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an automatic data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific longcontext capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in realworld tasks such as long-context retrieval augmented generation. The field of Artificial Intelligence (AI) and Natural Language Processing (NLP) have made substantial progress in building and teaching neural language models (LMs) to understand and generate language (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023; Anthropic, 2023; 2024; Touvron et al., 2023a;b; MetaAI et al., 2024). Large-scale deep learning has enabled large LMs to learn from massive amounts of human-generated text (Radford et al., 2019; Brown et al., 2020).