sequence data
Supplementary Material Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation Yingyi Chen
Comments on Theorem 3.2 With the primal problem in (6) in the paper, Theorem 3.2 provides Additionally, [27] presents the optimization w.r.t. a single projection direction in Therefore, our KSVD is more general in the data setups. Remark 3.3, we show that the values can be regarded as playing the role of the dual variables Using data-dependent projection weights does not affect the derivation of the shifted eigenvalue problem in the dual. With the derivations of the primal-dual optimization problems above, the primal-dual model representation of our KSVD problem can be provided correspondingly. Lemma 4.2 evaluates the objective value Moreover, as in the proof of Theorem 3.2, we note that the regularization coefficient This section provides the implementation details of all experiments included in the paper. This will be illustrated in details in the following.Algorithm 1 Learning with Primal-AttentionRequire: X:= [ x UEA Time Series The UEA time series benchmark [31] consists of 30 datasets. Following the setup in [11], we select 10 datasets for evaluation.
Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints
There has been significant recent progress designing deep generative models that generate realistic sequence data such as text or music. Nevertheless, it remains difficult to incorporate high-level structure to guide the generative process, and many such models perform well on local coherence, but less so on global coherence. We propose a novel approach for incorporating global structure in the form of relational constraints between different subcomponents of an example (e.g., lines of a poem or measures of music). Our generative model has two parts: (i) one model to generate a realistic set of relational constraints, and (ii) a second model to generate realistic data satisfying these constraints. For model (i), we propose a constrained optimization algorithm that infers the relational constraints present in the training data, and then learn a generative model based on the resulting constraint data. In our experiments, we show that our approach significantly improves over state-of-the-art in terms of capturing high-level structure in the data, while performing comparably or better in terms of low-level structure. We also show that using constrained optimization for part (ii) as well leads to increased controllability with little decrease in quality compared to pure learning-based models.
A Probabilistic Framework for Imputing Genetic Distances in Spatiotemporal Pathogen Models
Stone, Haley, Du, Jing, Xue, Hao, Scotch, Matthew, Heslop, David, Züfle, Andreas, MacIntyre, Chandini Raina, Salim, Flora
Pathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known sequences within defined transmission chains, using time-aware evolutionary distance modeling. The method estimates pairwise divergence from collection dates and observed genetic distances, enabling biologically plausible imputation grounded in observed divergence patterns, without requiring sequence alignment or known transmission chains. Applied to highly pathogenic avian influenza A/H5 cases in wild birds in the United States, this approach supports scalable, uncertainty-aware augmentation of genomic datasets and enhances the integration of evolutionary information into spatiotemporal modeling workflows.
Pre-training Generative Recommender with Multi-Identifier Item Tokenization
Zheng, Bowen, Liu, Enze, Chen, Zhongfu, Ma, Zhongrui, Wang, Yue, Zhao, Wayne Xin, Wen, Ji-Rong
Generative recommendation autoregressively generates item identifiers to recommend potential items. Existing methods typically adopt a one-to-one mapping strategy, where each item is represented by a single identifier. However, this scheme poses issues, such as suboptimal semantic modeling for low-frequency items and limited diversity in token sequence data. To overcome these limitations, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Our approach involves two key innovations: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we leverage the RQ-VAE as the tokenizer backbone and treat model checkpoints from adjacent training epochs as semantically relevant tokenizers. This allows each item to be associated with multiple identifiers, enabling a single user interaction sequence to be converted into several token sequences as different data groups. For curriculum recommender pre-training, we introduce a curriculum learning scheme guided by data influence estimation, dynamically adjusting the sampling probability of each data group during recommender pre-training. After pre-training, we fine-tune the model using a single tokenizer to ensure accurate item identification for recommendation. Extensive experiments on three public benchmark datasets demonstrate that MTGRec significantly outperforms both traditional and generative recommendation baselines in terms of effectiveness and scalability.
Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes
Xu, Zhiyao, Zhao, Dan, Zou, Qingsong, Xiao, Jingyu, Jiang, Yong, Yuan, Zhenhui, Li, Qing
In recent years, as smart home systems have become more widespread, security concerns within these environments have become a growing threat. Currently, most smart home security solutions, such as anomaly detection and behavior prediction models, are trained using fixed datasets that are precollected. However, the process of dataset collection is time-consuming and lacks the flexibility needed to adapt to the constantly evolving smart home environment. Additionally, the collection of personal data raises significant privacy concerns for users. Lately, large language models (LLMs) have emerged as a powerful tool for a wide range of tasks across diverse application domains, thanks to their strong capabilities in natural language processing, reasoning, and problem-solving. In this paper, we propose an LLM-based synthetic dataset generation IoTGen framework to enhance the generalization of downstream smart home intelligent models. By generating new synthetic datasets that reflect changes in the environment, smart home intelligent models can be retrained to overcome the limitations of fixed and outdated data, allowing them to better align with the dynamic nature of real-world home environments. Specifically, we first propose a Structure Pattern Perception Compression (SPPC) method tailored for IoT behavior data, which preserves the most informative content in the data while significantly reducing token consumption. Then, we propose a systematic approach to create prompts and implement data generation to automatically generate IoT synthetic data with normative and reasonable properties, assisting task models in adaptive training to improve generalization and real-world performance.
Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints
There has been significant recent progress designing deep generative models that generate realistic sequence data such as text or music. Nevertheless, it remains difficult to incorporate high-level structure to guide the generative process, and many such models perform well on local coherence, but less so on global coherence. We propose a novel approach for incorporating global structure in the form of relational constraints between different subcomponents of an example (e.g., lines of a poem or measures of music). Our generative model has two parts: (i) one model to generate a realistic set of relational constraints, and (ii) a second model to generate realistic data satisfying these constraints. For model (i), we propose a constrained optimization algorithm that infers the relational constraints present in the training data, and then learn a generative model based on the resulting constraint data.
Hilbert Curve Based Molecular Sequence Analysis
Ali, Sarwan, Ali, Tamkanat E, Khan, Imdad Ullah, Patterson, Murray
Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods.
STKDRec: Spatial-Temporal Knowledge Distillation for Takeaway Recommendation
Zhao, Shuyuan, Chen, Wei, Shi, Boyan, Zhou, Liyong, Lin, Shuohao, Wan, Huaiyu
The takeaway recommendation system is designed to recommend users' future takeaway purchases based on their historical purchase behaviors, thereby improving user satisfaction and increasing merchant sales. Existing methods focus on incorporating auxiliary information or leveraging knowledge graphs to alleviate the sparsity issue of user purchase sequence data. However, two main challenges limit the performance of these approaches: (1) how to capture dynamic user preferences on complex geospatial information and (2) how to efficiently integrate spatial-temporal knowledge from graphs and sequence data with low calculation costs. In this paper, we propose a novel spatial-temporal knowledge distillation for takeaway recommendation model (STKDRec) based on the two-stage training process. Specifically, during the first pre-training stage, a spatial-temporal knowledge graph (STKG) encoder is pre-trained to extract the high-order spatial-temporal and collaborative associations within the STKG. During the second STKD stage, a spatial-temporal Transformer is employed to comprehensively model dynamic user preferences on various types of fine-grained geospatial information from a sequence perspective. Furthermore, the STKD strategy is introduced to adaptively fuse the rich spatial-temporal knowledge from the pre-trained STKG encoder and the spatial-temporal transformer while reducing the cost of model training. Extensive experiments on three real-world datasets show that our STKDRec significantly outperforms the state-of-the-art baselines. Our code is available at:https://github.com/Zhaoshuyuan0246/STKDRec.