Goto

Collaborating Authors

 Pacific Ocean


KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

arXiv.org Artificial Intelligence

The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Although the KV cache Figure 1: Previous methods primarily focus greatly helps improve inference speed, it also significantly on discarding Keys and Values within layers. During the LLM inference In contrast, we share KV caches across layers phase, the KV cache typically accounts for based on their dissimilarity. Recent research has seen a proliferation of methods aimed at compressing KV caches to reduce memory consumption (Zandieh et al., 2024; Xu et al., 2024; Yang et al., 2024b; Zhang et al., 2024b;a; Dong et al., 2024). However, these efforts have predominantly focused on intra-layer KV cache compression within individual Transformer layers of LLM.


Context is Key: A Benchmark for Forecasting with Essential Textual Information

arXiv.org Machine Learning

Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/ .


On the potential of Optimal Transport in Geospatial Data Science

arXiv.org Artificial Intelligence

Prediction problems in geographic information science and transportation are often motivated by the possibility to enhance operational efficiency and thereby reduce emissions. Examples range from predicting car sharing demand for relocation planning to forecasting traffic congestion for navigation purposes. However, conventional accuracy metrics ignore the spatial distribution of the errors, despite its relevance for operations. Here, we put forward a spatially aware evaluation metric and loss function based on Optimal Transport (OT). Our framework leverages partial OT and can minimize relocation costs in any spatial prediction problem. We showcase the advantages of OT-based evaluation over conventional metrics and further demonstrate the application of an OT loss function for improving forecasts of bike sharing demand and charging station occupancy.


Anthropic Wants Its AI Agent to Control Your Computer

WIRED

It took a while for people to adjust to the idea of chatbots that seem to have minds of their own. The next leap into the unknown may involve trusting artificial intelligence to take over our computers, too. Anthropic, a high-flying competitor to OpenAI, announced today that it has taught its AI model Claude to do a range of things on a computer, including search the web, open applications, and input text using the mouse and keyboard. "I think we're going to enter into a new era where a model can use all of the tools that you use as a person to get tasks done," says Jared Kaplan, chief science officer at Anthropic and an associate professor at Johns Hopkins University. Kaplan showed WIRED a prerecorded demo in which an "agentic"--or tool-using--version of Claude had been asked to help plan an outing to see the sunrise at the Golden Gate Bridge with a friend.


The Persian Rug: solving toy models of superposition using large-scale symmetries

arXiv.org Artificial Intelligence

We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model's performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at https://github.com/KfirD/PersianRug .


Less yet robust: crucial region selection for scene recognition

arXiv.org Artificial Intelligence

Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.


FastSTI: A Fast Conditional Pseudo Numerical Diffusion Model for Spatio-temporal Traffic Data Imputation

arXiv.org Artificial Intelligence

High-quality spatiotemporal traffic data is crucial for intelligent transportation systems (ITS) and their data-driven applications. Inevitably, the issue of missing data caused by various disturbances threatens the reliability of data acquisition. Recent studies of diffusion probability models have demonstrated the superiority of deep generative models in imputation tasks by precisely capturing the spatio-temporal correlation of traffic data. One drawback of diffusion models is their slow sampling/denoising process. In this work, we aim to accelerate the imputation process while retaining the performance. We propose a fast conditional diffusion model for spatiotemporal traffic data imputation (FastSTI). To speed up the process yet, obtain better performance, we propose the application of a high-order pseudo-numerical solver. Our method further revs the imputation by introducing a predefined alignment strategy of variance schedule during the sampling process. Evaluating FastSTI on two types of real-world traffic datasets (traffic speed and flow) with different missing data scenarios proves its ability to impute higher-quality samples in only six sampling steps, especially under high missing rates (60\% $\sim$ 90\%). The experimental results illustrate a speed-up of $\textbf{8.3} \times$ faster than the current state-of-the-art model while achieving better performance.


ANT: Adaptive Noise Schedule for Time Series Diffusion Models

arXiv.org Machine Learning

Advances in diffusion models for generative artificial intelligence have recently propagated to the time series (TS) domain, demonstrating state-of-the-art performance on various tasks. However, prior works on TS diffusion models often borrow the framework of existing works proposed in other domains without considering the characteristics of TS data, leading to suboptimal performance. In this work, we propose Adaptive Noise schedule for Time series diffusion models (ANT), which automatically predetermines proper noise schedules for given TS datasets based on their statistics representing non-stationarity. Our intuition is that an optimal noise schedule should satisfy the following desiderata: 1) It linearly reduces the non-stationarity of TS data so that all diffusion steps are equally meaningful, 2) the data is corrupted to the random noise at the final step, and 3) the number of steps is sufficiently large. The proposed method is practical for use in that it eliminates the necessity of finding the optimal noise schedule with a small additional cost to compute the statistics for given datasets, which can be done offline before training. We validate the effectiveness of our method across various tasks, including TS forecasting, refinement, and generation, on datasets from diverse domains. Code is available at this repository: https://github.com/seunghan96/ANT.


Analyzing Deep Transformer Models for Time Series Forecasting via Manifold Learning

arXiv.org Artificial Intelligence

Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision. However, despite ongoing research efforts to better understand these models, the field still lacks a comprehensive understanding. This is particularly true for deep time series forecasting methods, where analysis and understanding work is relatively limited. Time series data, unlike image and text information, can be more challenging to interpret and analyze. To address this, we approach the problem from a manifold learning perspective, assuming that the latent representations of time series forecasting models lie next to a low-dimensional manifold. In our study, we focus on analyzing the geometric features of these latent data manifolds, including intrinsic dimension and principal curvatures. Our findings reveal that deep transformer models exhibit similar geometric behavior across layers, and these geometric features are correlated with model performance. Additionally, we observe that untrained models initially have different structures, but they rapidly converge during training. By leveraging our geometric analysis and differentiable tools, we can potentially design new and improved deep forecasting neural networks. This approach complements existing analysis studies and contributes to a better understanding of transformer models in the context of time series forecasting. Code is released at https://github.com/azencot-group/GATLM.


Prompt Compression for Large Language Models: A Survey

arXiv.org Artificial Intelligence

Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.