Goto

Collaborating Authors

 Technology


Investigating Hallucinations of Time Series Foundation Models through Signal Subspace Analysis

Neural Information Processing Systems

Times series foundation models (TSFMs) have emerged as a promising paradigm for time series analyses and forecasting, showing remarkable generalization performance across different domains. Despite the efforts made on hallucinations of foundation models, hallucinations of TSFMs have been underexplored in existing literature. In this paper, we formally define TSFM hallucinations in the zero-shot forecasting setting by examining whether a generated forecast exhibits different dynamics from those of the context. Our study reveals that TSFM hallucinations are associated with the loss of context information in hidden states during forward propagation. As such, we propose a methodology to identify signal subspaces of TSFMs and magnify the information through intervention. Experiments demonstrate that our proposed intervention approach effectively mitigates hallucinations and improves forecasting performance. The signal strength measure computed from signal subspaces shows strong predictive power of hallucinations and forecasting performance of the model. Our work contributes to deeper understanding of TSFM trustworthiness that could foster future research in this direction.


Does Stochastic Gradient really succeed for bandits?

Neural Information Processing Systems

Recent works of Mei et al. (2023, 2024) have deepened the theoretical understanding of the *Stochastic Gradient Bandit* (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently *small* learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret *regimes* of SGB as a function of its learning rate. For two--armed bandits, we identify a sharp threshold, scaling with the sub-optimality gap $\Delta$, below which SGB achieves *logarithmic* regret on all instances, and above which it can incur *polynomial* regret on some instances. This result highlights the necessity of knowing (or estimating) $\Delta$ to ensure logarithmic regret with a constant learning rate. For general $K$-armed bandits, we further show the learning rate must scale inversely with $K$ to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.


Titans: Learning to Memorize at Test Time

Neural Information Processing Systems

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long-past information. We show that this neural memory has the advantage of fast parallelizable training. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, and time series tasks show that Titans are effective compared to baselines, while they can effectively scale to larger context window in needle-in-haystack tasks.


Graph–Smoothed Bayesian Black-Box Shift Estimator and Its Information Geometry

Neural Information Processing Systems

Label shift adaptation aims to recover target class priors when the labelled source distribution $P$ and the unlabelled target distribution $Q$ share $P(X \mid Y) = Q(X \mid Y)$ but $P(Y) \neq Q(Y)$. Classical black box shift estimators invert an empirical confusion matrix of a frozen classifier, producing a brittle point estimate that ignores sampling noise and similarity among classes.


ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS

Neural Information Processing Systems

Feed-forward 3D Gaussian Splatting (3DGS) models have recently emerged as a promising solution for novel view synthesis, enabling one-pass inference without the need for per-scene 3DGS optimization. However, their scalability is fundamentally constrained by the limited capacity of their encoders, leading to degraded performance or excessive memory consumption as the number of input views increases. In this work, we analyze feed-forward 3DGS frameworks through the lens of the Information Bottleneck principle and introduce ZPressor, a lightweight architecture-agnostic module that enables efficient compression of multi-view inputs into a compact latent state $Z$ that retains essential scene information while discarding redundancy. Concretely, ZPressor enables existing feed-forward 3DGS models to scale to over 100 input views at 480P resolution on an 80GB GPU, by partitioning the views into anchor and support sets and using cross attention to compress the information from the support views into anchor views, forming the compressed latent state $Z$. We show that integrating ZPressor into several state-of-the-art feed-forward 3DGS models consistently improves performance under moderate input views and enhances robustness under dense view settings on two large-scale benchmarks DL3DV-10K and RealEstate10K.


CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Neural Information Processing Systems

Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory---structured schemata, flexible assimilation, and dynamic accommodation.


Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Neural Information Processing Systems

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose $\texttt{Mozart}$, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, $\texttt{Mozart}$ exploits the inherent modularity of chiplets and introduces: ($1$) an expert allocation strategy that enables efficient on-package all-to-all communication, and ($2$) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, $\texttt{Mozart}$ adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.


Wasserstein Convergence of Critically Damped Langevin Diffusions

Neural Information Processing Systems

Score-based Generative Models (SGMs) have achieved impressive performance in data generation across a wide range of applications and benefit from strong theoretical guarantees. Recently, methods inspired by statistical mechanics, in particular, Hamiltonian dynamics, have introduced Critically-damped Langevin Diffusions (CLDs), which define diffusion processes on extended spaces by coupling the data with auxiliary variables. These approaches, along with their associated score-matching and sampling procedures, have been shown to outperform standard diffusion-based samplers numerically. In this paper, we analyze a generalized dynamic that extends classical CLDs by introducing an additional hyperparameter controlling the noise applied to the data coordinate, thereby better exploiting the extended space. We further derive a novel upper bound on the sampling error of CLD-based generative models in the Wasserstein metric. This additional hyperparameter influences the smoothness of sample paths, and our discretization error analysis provides practical guidance for its tuning, leading to improved sampling performance.


Variance-Reduced Long-Term Rehearsal Learning with Quadratic Programming Reformulation

Neural Information Processing Systems

In machine learning, a critical class of decision-making problems involves *Avoiding Undesired Future* (AUF): given a predicted undesired outcome, how can one make decision about actions to prevent it? Recently, the *rehearsal learning* framework has been proposed to address AUF problem. While existing methods offer reliable decisions for single-round success, this paper considers long-term settings that involve coordinating multiple future outcomes, which is often required in real-world tasks. Specifically, we generalize the AUF objective to characterize a long-term decision target that incorporates cross-temporal relations among variables. As directly optimizing the *AUF probability* $\mathbb{P}_{\operatorname{AUF}}$ over this objective remains challenging, we derive an explicit expression for the objective and further propose a quadratic programming (QP) reformulation that transforms the intractable probabilistic AUF optimization into a tractable one. Under mild assumptions, we show that solutions to the QP reformulation are equivalent to those of the original AUF optimization, based on which we develop two novel rehearsal learning methods for long-term decision-making: (i) a *greedy* method that maximizes the single-round $\mathbb{P}_{\operatorname{AUF}}$ at each step, and (ii) a *far-sighted* method that accounts for future consequences in each decision, yielding a higher overall $\mathbb{P}_{\operatorname{AUF}}$ through an $L/(L+1)$ variance reduction in the AUF objective. We further establish an $\mathcal{O}(1/\sqrt{N})$ excess risk bound for decisions based on estimated parameters, ensuring reliable practical applicability with finite data.


Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Neural Information Processing Systems

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open-and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets.