AITopics | short sequence

Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

Neural Information Processing SystemsJun-21-2026, 09:37:02 GMT

Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia (0.67)
North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

Neural Information Processing SystemsJun-13-2026, 17:23:36 GMT

Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT.

artificial intelligence, large language model, natural language, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.85)

Add feedback

Meta Learning with Relational Information for Short Sequences

Yujia Xie, Haoming Jiang, Feng Liu, Tuo Zhao, Hongyuan Zha

Neural Information Processing SystemsFeb-12-2026, 13:36:47 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, harmless, machine learning, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
North America > United States > Pennsylvania > Montgomery County (0.04)
North America > United States > New York (0.04)
(3 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

Meta Learning with Relational Information for Short Sequences

Neural Information Processing SystemsDec-25-2025, 13:27:26 GMT

This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta Learning method for Short Sequences) for learning heterogeneous point process models from a collection of short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptively learning for each individual sequence. We further propose an efficient stochastic variational meta-EM algorithm, which can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.

meta learning, relational information, short sequence, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Meta Learning with Relational Information for Short Sequences

Yujia Xie, Haoming Jiang, Feng Liu, Tuo Zhao, Hongyuan Zha

Neural Information Processing SystemsOct-2-2025, 23:32:10 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, sequence, (15 more...)

Neural Information Processing Systems

Country:

North America (0.28)
Asia > China (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs

Ge, Hao, Feng, Junda, Huang, Qi, Fu, Fangcheng, Nie, Xiaonan, Zuo, Lei, Lin, Haibin, Cui, Bin, Liu, Xin

arXiv.org Artificial IntelligenceFeb-28-2025

Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.

communication, computation, sequence, (15 more...)

arXiv.org Artificial Intelligence

2502.21231

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States (0.04)
Asia > Middle East > Jordan (0.04)
Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reviews: Meta Learning with Relational Information for Short Sequences

Neural Information Processing SystemsJan-24-2025, 15:43:31 GMT

In this paper, the authors propose a hierarchical Bayesian mixture of Hawkes processes with a parameter adaptation mechanism based on a meta-learning technique for modeling multiple short event sequences with graph-like side information. In the proposed model, each sequence is modeled by a mixture of Hawkes processes, whose mixture ratio has relation to the adjacency of the sequence to the other sequences. Moreover, the parameters of the component Hawkes processes are slightly varied among sequences using the mechanism of the model-agnostic meta-learning framework. The authors provide experimental results on synthetic and real-world datasets, which show the superiority of the proposed method. Overall, the paper is very well written.

hawke process, relational information, sequence, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviews: Meta Learning with Relational Information for Short Sequences

Neural Information Processing SystemsJan-24-2025, 15:43:20 GMT

The paper proposes a hierarchical model for multivariate point process data with known network information. It uses a mixture of Hawkes processes for the point process observations, and the treats the observed network as a mixed membership stochastic block model sharing the same mixture weights. The main technical novelty is to use model agnostic meta-learning (MAML) to implement the hierarchical prior on the Hawkes process parameters. However, this technical contribution (MAML Hawkes/network models) is not compared to standard hierarchical Bayesian techniques. Specifically, the parameters \theta_{k} {(i)} are only three dimensional (background rate \mu, scale \delta, and time constant \omega).

meta learning, relational information, short sequence, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Meta Learning with Relational Information for Short Sequences

Neural Information Processing SystemsOct-10-2024, 06:44:01 GMT

This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta Learning method for Short Sequences) for learning heterogeneous point process models from a collection of short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptively learning for each individual sequence. We further propose an efficient stochastic variational meta-EM algorithm, which can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.

meta learning, relational information, short sequence, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Frohmann, Markus, Sterner, Igor, Vulić, Ivan, Minixhofer, Benjamin, Schedl, Markus

arXiv.org Artificial IntelligenceJun-24-2024

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.

computational linguistic, proceedings, segmentation, (14 more...)

arXiv.org Artificial Intelligence

2406.16678

Country: