Goto

Collaborating Authors

 Technology


Improving Progressive Generation with Decomposable Flow Matching

Neural Information Processing Systems

Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, ad-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media.


Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL

Neural Information Processing Systems

A key challenge in offline multi-agent reinforcement learning (MARL) is achieving effective many-agent multi-step coordination in complex environments. In this work, we propose Oryx, a novel algorithm for offline cooperative MARL to directly address this challenge. Oryx adapts the recently proposed retention-based architecture Sable and combines it with a sequential form of implicit constraint Q-learning (ICQ), to develop a novel offline autoregressive policy update scheme. This allows Oryx to solve complex coordination challenges while maintaining temporal coherence over long trajectories. We evaluate Oryx across a diverse set of benchmarks from prior works--SMAC, RWARE, and Multi-Agent MuJoCo--covering tasks of both discrete and continuous control, varying in scale and difficulty. Oryx achieves state-of-the-art performance on more than 80% of the 65 tested datasets, outperforming prior offline MARL methods and demonstrating robust generalisation across domains with many agents and long horizons. Finally, we introduce new datasets to push the limits of many-agent coordination in offline MARL, and demonstrate Oryx's superior ability to scale effectively in such settings.


DINO-Foresight: Looking into the Future with DINO

Neural Information Processing Systems

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework.


Brain-like Variational Inference

Neural Information Processing Systems

Inference in both brains and machines can be formalized by optimizing a shared objective: maximizing the evidence lower bound (ELBO) in machine learning, or minimizing variational free energy ($\mathcal{F}$) in neuroscience (ELBO = $-\mathcal{F}$). While this equivalence suggests a unifying framework, it leaves open how inference is implemented in neural systems. Here, we introduce FOND (*Free energy Online Natural-gradient Dynamics*), a framework that derives neural inference dynamics from three principles: (1) natural gradients on $\mathcal{F}$, (2) online belief updating, and (3) iterative refinement. We apply FOND to derive iP-VAE (*iterative Poisson variational autoencoder*), a recurrent spiking neural network that performs variational inference through membrane potential dynamics, replacing amortized encoders with iterative inference updates.


Joint Relational Database Generation via Graph-Conditional Diffusion Models

Neural Information Processing Systems

Building generative models for relational databases (RDBs) is important for many applications, such as privacy-preserving data release and augmenting real datasets. However, most prior works either focus on single-table generation or adapt single-table models to the multi-table setting by relying on autoregressive factorizations and sequential generation. These approaches limit parallelism, restrict flexibility in downstream applications, and compound errors due to commonly made conditional independence assumptions. In this paper, we propose a fundamentally different approach: jointly modeling all tables in an RDB without imposing any table order. By using a natural graph representation of RDBs, we propose the Graph-Conditional Relational Diffusion Model (GRDM), which leverages a graph neural network to jointly denoise row attributes and capture complex inter-table dependencies. Extensive experiments on six real-world RDBs demonstrate that our approach substantially outperforms autoregressive baselines in modeling multi-hop inter-table correlations and achieves state-of-the-art performance on single-table fidelity metrics.


PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

Neural Information Processing Systems

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action--defined by multi-dimensional attributes such as time, context, and transaction type--constitutes a behavioral token. Modeling these high-cardinality, sparse, and irregular sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text.


Dynamic Shadow Unveils Invisible Semantics for Video Outpainting

Neural Information Processing Systems

Conventional video outpainting methods primarily focus on maintaining coherent textures and visual consistency across frames. However, they often fail at handling dynamic scenes due to the complex motion of objects or camera movement, leading to temporal incoherence and visible flickering artifacts across frames. This is primarily because they lack instance-aware modeling to accurately separate and track individual object motions throughout the video. In this paper, we propose a novel video outpainting framework that explicitly takes shadow-object pairs into consideration to enhance the temporal and spatial consistency of instances, even when they are temporarily invisible. Specifically, we first track the shadow-object pairs across frames and predict the instances in the scene to unveil the spatial regions of invisible instances. Then, these prediction results are fed to guide the instance-aware optical flow completion to unveil the temporal motion of invisible instances. Next, these spatiotemporal guidances of instances are used to guide the video outpainting process. Finally, a video-aware discriminator is implemented to enhance alignment among dynamic shadows and the extended semantics in the scene. Comprehensive experiments underscore the superiority of our approach, outperforming existing state-of-the-art methods in widely recognized benchmarks.


Towards Irreversible Attack: Fooling Scene Text Recognition via Multi-Population Coevolution Search

Neural Information Processing Systems

Recent work has shown that scene text recognition (STR) models are vulnerable to adversarial examples. Different from non-sequential vision tasks, the output sequence of STR models contains rich information. However, existing adversarial attacks against STR models can only lead to a few incorrect characters in the predicted text. These attack results still carry partial information about the original prediction and could be easily corrected by an external dictionary or a language model. Therefore, we propose the Multi-Population Coevolution Search (MPCS) method to attack each character in the image. We first decompose the global optimization objective into sub-objectives to solve the attack pixel concentration problem existing in previous attack methods. While this distributed optimization paradigm brings a new joint perturbation shift problem, we propose a novel coevolution energy function to solve it. Experiments on recent STR models show the superiority of our method.


Gaze-VLM: Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

Neural Information Processing Systems

Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal, our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11$\%$ for future event prediction and around 7$\%$ for current activity understanding, compared to the corresponding baseline models trained without gaze regularization.


Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Neural Information Processing Systems

Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling.