prolong
LongAttn: Selecting Long-context Training Data via Token-level Attention
Wu, Longyun, Zhu, Dawei, Zhao, Guangxiang, Yu, Zhuocheng, Ran, Junfeng, Wong, Xiangyu, Sun, Lin, Li, Sujian
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (2 more...)
How to Train Long-Context Language Models (Effectively)
Gao, Tianyu, Wettig, Alexander, Yen, Howard, Chen, Danqi
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development-- instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- Asia > Middle East > Jordan (0.04)
- (6 more...)
- Information Technology (0.67)
- Education (0.46)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models
Chen, Longze, Liu, Ziqiang, He, Wanwei, Li, Yunshui, Luo, Run, Yang, Min
Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do not exhibit strong semantic dependencies across long contexts. In this study, we propose a data mining framework \textbf{ProLong} that can assign each training sample with a long dependency score, which can be used to rank and filter samples that are more advantageous for enhancing long-context modeling abilities in LLM training. Specifically, we first use delta perplexity scores to measure the \textit{Dependency Strength} between text segments in a given document. Then we refine this metric based on the \textit{Dependency Distance} of these segments to incorporate spatial relationships across long-contexts. Final results are calibrated with a \textit{Dependency Specificity} metric to prevent trivial dependencies introduced by repetitive patterns. Moreover, a random sampling approach is proposed to optimize the computational efficiency of ProLong. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies and LLMs trained on these documents exhibit significantly enhanced long-context modeling capabilities.
- Asia > Vietnam > Long An Province (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Catch and Prolong: recurrent neural network for seeking track-candidates
Baranov, Dmitriy, Ososkov, Gennady, Goncharov, Pavel, Tsytrinov, Andrei
One of the most important problems of data processing in high energy and nuclear physics is the event reconstruction. Its main part is the track reconstruction procedure which consists in looking for all tracks that elementary particles leave when they pass through a detector among a huge number of points, so-called hits, produced when flying particles fire detector coordinate planes. Unfortunately, the tracking is seriously impeded by the famous shortcoming of multiwired, strip and GEM detectors due to appearance in them a lot of fake hits caused by extra spurious crossings of fired strips. Since the number of those fakes is several orders of magnitude greater than for true hits, one faces with the quite serious difficulty to unravel possible track-candidates via true hits ignoring fakes. We introduce a renewed method that is a significant improvement of our previous two-stage approach based on hit preprocessing using directed K-d tree search followed a deep neural classifier. We combine these two stages in one by applying recurrent neural network that simultaneously determines whether a set of points belongs to a true track or not and predicts where to look for the next point of track on the next coordinate plane of the detector. We show that proposed deep network is more accurate, faster and does not require any special preprocessing stage. Preliminary results of our approach for simulated events of the BM@N GEM detector are presented.
- Europe > Belarus > Gomel Region > Gomel (0.05)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Europe > Montenegro > Budva > Budva (0.04)
- Asia > Russia (0.04)