Saini, Uday Singh
Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers
Sun, Jiarui, Yeh, Chin-Chia Michael, Fan, Yujie, Dai, Xin, Fan, Xiran, Jiang, Zhimeng, Saini, Uday Singh, Lai, Vivian, Wang, Junpeng, Chen, Huiyuan, Zhuang, Zhongfang, Zheng, Yan, Chowdhary, Girish
Time series forecasting at scale presents significant challenges for modern prediction systems, particularly when dealing with large sets of synchronized series, such as in a global payment network. In such systems, three key challenges must be overcome for accurate and scalable predictions: 1) emergence of new entities, 2) disappearance of existing entities, and 3) the large number of entities present in the data. The recently proposed Inverted Transformer (iTransformer) architecture has shown promising results by effectively handling variable entities. However, its practical application in large-scale settings is limited by quadratic time and space complexity ($O(N^2)$) with respect to the number of entities $N$. In this paper, we introduce EiFormer, an improved inverted transformer architecture that maintains the adaptive capabilities of iTransformer while reducing computational complexity to linear scale ($O(N)$). Our key innovation lies in restructuring the attention mechanism to eliminate redundant computations without sacrificing model expressiveness. Additionally, we incorporate a random projection mechanism that not only enhances efficiency but also improves prediction accuracy through better feature representation. Extensive experiments on the public LargeST benchmark dataset and a proprietary large-scale time series dataset demonstrate that EiFormer significantly outperforms existing methods in both computational efficiency and forecasting accuracy. Our approach enables practical deployment of transformer-based forecasting in industrial applications where handling time series at scale is essential.
Visual Attention Exploration in Vision-Based Mamba Models
Wang, Junpeng, Yeh, Chin-Chia Michael, Saini, Uday Singh, Das, Mahashweta
State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.
A Compact Model for Large-Scale Time Series Forecasting
Yeh, Chin-Chia Michael, Fan, Xiran, Jiang, Zhimeng, Fan, Yujie, Chen, Huiyuan, Saini, Uday Singh, Lai, Vivian, Dai, Xin, Wang, Junpeng, Zhuang, Zhongfang, Wang, Liang, Zheng, Yan
Spatio-temporal data, which commonly arise in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represent a special category of multivariate time series. They exhibit two distinct characteristics: high dimensionality and commensurability across spatial locations. These attributes call for computationally efficient modeling approaches and facilitate the use of univariate forecasting models in a channel-independent fashion. SparseTSF, a recently introduced competitive univariate forecasting model, harnesses periodicity to achieve compactness by concentrating on cross-period dynamics, thereby extending the Pareto frontier with respect to model size and predictive performance. Nonetheless, it underperforms on spatio-temporal data due to an inadequate capture of intra-period temporal dependencies. To address this shortcoming, we propose UltraSTF, which integrates a cross-period forecasting module with an ultra-compact shape bank component. Our model effectively detects recurring patterns in time series through the attention mechanism of the shape bank component, thereby strengthening its ability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while employing fewer than 0.2% of the parameters required by the second-best approaches, thus further extending the Pareto frontier of existing methods.
RPMixer: Shaking Up Time Series Forecasting with Random Projections for Large Spatial-Temporal Data
Yeh, Chin-Chia Michael, Fan, Yujie, Dai, Xin, Saini, Uday Singh, Lai, Vivian, Aboagye, Prince Osei, Wang, Junpeng, Chen, Huiyuan, Zheng, Yan, Zhuang, Zhongfang, Wang, Liang, Zhang, Wei
Spatial-temporal forecasting systems play a crucial role in addressing numerous real-world challenges. In this paper, we investigate the potential of addressing spatial-temporal forecasting problems using general time series forecasting models, i.e., models that do not leverage the spatial relationships among the nodes. We propose a all-Multi-Layer Perceptron (all-MLP) time series forecasting architecture called RPMixer. The all-MLP architecture was chosen due to its recent success in time series forecasting benchmarks. Furthermore, our method capitalizes on the ensemble-like behavior of deep neural networks, where each individual block within the network behaves like a base learner in an ensemble model, particularly when identity mapping residual connections are incorporated. By integrating random projection layers into our model, we increase the diversity among the blocks' outputs, thereby improving the overall performance of the network. Extensive experiments conducted on the largest spatial-temporal forecasting benchmark datasets demonstrate that the proposed method outperforms alternative methods, including both spatial-temporal graph models and general forecasting models.
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach
Aboagye, Prince, Zheng, Yan, Wang, Junpeng, Saini, Uday Singh, Dai, Xin, Yeh, Michael, Fan, Yujie, Zhuang, Zhongfang, Jain, Shubham, Wang, Liang, Zhang, Wei
The emergence of pre-trained models has significantly impacted Natural Language Processing (NLP) and Computer Vision to relational datasets. Traditionally, these models are assessed through fine-tuned downstream tasks. However, this raises the question of how to evaluate these models more efficiently and effectively. In this study, we explore a novel approach where we leverage the metafeatures associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models, and image models. Pre-training on large models is becoming increasingly common in various machine learning applications, thanks to the growing amount of user-generated content. This is evident in areas such as Natural Language Processing (NLP) with models like GPT (Generative Pretrained Transformer), and in the vision-language domain with models like CLIP. Typically, the effectiveness of these models is evaluated using downstream tasks. However, these can be relatively costly if all tasks need to be performed.
CARL-G: Clustering-Accelerated Representation Learning on Graphs
Shiao, William, Saini, Uday Singh, Liu, Yozen, Zhao, Tong, Shah, Neil, Papalexakis, Evangelos E.
Self-supervised learning on graphs has made large strides in achieving great performance in various downstream tasks. However, many state-of-the-art methods suffer from a number of impediments, which prevent them from realizing their full potential. For instance, contrastive methods typically require negative sampling, which is often computationally costly. While non-contrastive methods avoid this expensive step, most existing methods either rely on overly complex architectures or dataset-specific augmentations. In this paper, we ask: Can we borrow from classical unsupervised machine learning literature in order to overcome those obstacles? Guided by our key insight that the goal of distance-based clustering closely resembles that of contrastive learning: both attempt to pull representations of similar items together and dissimilar items apart. As a result, we propose CARL-G - a novel clustering-based framework for graph representation learning that uses a loss inspired by Cluster Validation Indices (CVIs), i.e., internal measures of cluster quality (no ground truth required). CARL-G is adaptable to different clustering methods and CVIs, and we show that with the right choice of clustering method and CVI, CARL-G outperforms node classification baselines on 4/5 datasets with up to a 79x training speedup compared to the best-performing baseline. CARL-G also performs at par or better than baselines in node clustering and similarity search tasks, training up to 1,500x faster than the best-performing baseline. Finally, we also provide theoretical foundations for the use of CVI-inspired losses in graph representation learning.
A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens
Saini, Uday Singh, Papalexakis, Evangelos E.
Despite their increasing popularity and success in a variety of supervised learning problems, deep neural networks are extremely hard to interpret and debug: Given and already trained Deep Neural Net, and a set of test inputs, how can we gain insight into how those inputs interact with different layers of the neural network? Furthermore, can we characterize a given deep neural network based on it's observed behavior on different inputs? In this paper we propose a novel factorization based approach on understanding how different deep neural networks operate. In our preliminary results, we identify fascinating patterns that link the factorization rank (typically used as a measure of interestingness in unsupervised data analysis) with how well or poorly the deep network has been trained. Finally, our proposed approach can help provide visual insights on how high-level. interpretable patterns of the network's input behave inside the hidden layers of the deep network.