Focused Transformer: Contrastive Training for Context Scaling

Dec-26-2025, 06:51:16 GMT–Neural Information Processing Systems

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an additional context, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish.

contrastive training, focused transformer, name change, (3 more...)

Neural Information Processing Systems

Dec-26-2025, 06:51:16 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (0.60)
  - Machine Learning (0.37)