short sequence
- Asia > China > Guangdong Province > Shenzhen (0.05)
- North America > United States > Pennsylvania > Montgomery County (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Meta Learning with Relational Information for Short Sequences
This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta Learning method for Short Sequences) for learning heterogeneous point process models from a collection of short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptively learning for each individual sequence. We further propose an efficient stochastic variational meta-EM algorithm, which can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)
ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
Ge, Hao, Feng, Junda, Huang, Qi, Fu, Fangcheng, Nie, Xiaonan, Zuo, Lei, Lin, Haibin, Cui, Bin, Liu, Xin
Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
Reviews: Meta Learning with Relational Information for Short Sequences
In this paper, the authors propose a hierarchical Bayesian mixture of Hawkes processes with a parameter adaptation mechanism based on a meta-learning technique for modeling multiple short event sequences with graph-like side information. In the proposed model, each sequence is modeled by a mixture of Hawkes processes, whose mixture ratio has relation to the adjacency of the sequence to the other sequences. Moreover, the parameters of the component Hawkes processes are slightly varied among sequences using the mechanism of the model-agnostic meta-learning framework. The authors provide experimental results on synthetic and real-world datasets, which show the superiority of the proposed method. Overall, the paper is very well written.
Reviews: Meta Learning with Relational Information for Short Sequences
The paper proposes a hierarchical model for multivariate point process data with known network information. It uses a mixture of Hawkes processes for the point process observations, and the treats the observed network as a mixed membership stochastic block model sharing the same mixture weights. The main technical novelty is to use model agnostic meta-learning (MAML) to implement the hierarchical prior on the Hawkes process parameters. However, this technical contribution (MAML Hawkes/network models) is not compared to standard hierarchical Bayesian techniques. Specifically, the parameters \theta_{k} {(i)} are only three dimensional (background rate \mu, scale \delta, and time constant \omega).
Meta Learning with Relational Information for Short Sequences
This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta Learning method for Short Sequences) for learning heterogeneous point process models from a collection of short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptively learning for each individual sequence. We further propose an efficient stochastic variational meta-EM algorithm, which can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Frohmann, Markus, Sterner, Igor, Vulić, Ivan, Minixhofer, Benjamin, Schedl, Markus
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.
- North America > Dominican Republic (0.04)
- Asia > Singapore (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (26 more...)
- Overview (1.00)
- Research Report > New Finding (0.45)
- Leisure & Entertainment (1.00)
- Information Technology (0.67)
- Media > Music (0.46)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.45)
Meta-Learning for Neural Network-based Temporal Point Processes
Takimoto, Yoshiaki, Tanaka, Yusuke, Iwata, Tomoharu, Okawa, Maya, Kim, Hideaki, Toda, Hiroyuki, Kurashima, Takeshi
Human activities generate various event sequences such as taxi trip records, bike-sharing pick-ups, crime occurrence, and infectious disease transmission. The point process is widely used in many applications to predict such events related to human activities. However, point processes present two problems in predicting events related to human activities. First, recent high-performance point process models require the input of sufficient numbers of events collected over a long period (i.e., long sequences) for training, which are often unavailable in realistic situations. Second, the long-term predictions required in real-world applications are difficult. To tackle these problems, we propose a novel meta-learning approach for periodicity-aware prediction of future events given short sequences. The proposed method first embeds short sequences into hidden representations (i.e., task representations) via recurrent neural networks for creating predictions from short sequences. It then models the intensity of the point process by monotonic neural networks (MNNs), with the input being the task representations. We transfer the prior knowledge learned from related tasks and can improve event prediction given short sequences of target tasks. We design the MNNs to explicitly take temporal periodic patterns into account, contributing to improved long-term prediction performance. Experiments on multiple real-world datasets demonstrate that the proposed method has higher prediction performance than existing alternatives.
- North America > United States > New York (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- Transportation (0.68)
- Health & Medicine (0.48)
Contrastive Self-supervised Sequential Recommendation with Robust Augmentation
Liu, Zhiwei, Chen, Yongjun, Li, Jia, Yu, Philip S., McAuley, Julian, Xiong, Caiming
Sequential Recommendationdescribes a set of techniques to model dynamic user behavior in order to predict future interactions in sequential user data. At their core, such approaches model transition probabilities between items in a sequence, whether through Markov chains, recurrent networks, or more recently, Transformers. However both old and new issues remain, including data-sparsity and noisy data; such issues can impair the performance, especially in complex, parameter-hungry models. In this paper, we investigate the application of contrastive Self-Supervised Learning (SSL) to the sequential recommendation, as a way to alleviate some of these issues. Contrastive SSL constructs augmentations from unlabelled instances, where agreements among positive pairs are maximized. It is challenging to devise a contrastive SSL framework for a sequential recommendation, due to its discrete nature, correlations among items, and skewness of length distributions. To this end, we propose a novel framework, Contrastive Self-supervised Learning for sequential Recommendation (CoSeRec). We introduce two informative augmentation operators leveraging item correlations to create high-quality views for contrastive learning. Experimental results on three real-world datasets demonstrate the effectiveness of the proposed method on improving model performance and the robustness against sparse and noisy data. Our implementation is available online at \url{https://github.com/YChen1993/CoSeRec}
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)