Goto

Collaborating Authors

 justification


Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Neural Information Processing Systems

Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious "hallucination" issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize "robust" hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD).


Reward-Instruct: AReward-Centric Approach to Fast Photo-Realistic Image Generation

Neural Information Processing Systems

This paper addresses the challenge of achieving high-quality and fast image generation that aligns with complex human preferences. While recent advancements in diffusion models and distillation have enabled rapid generation, the effective integration of reward feedback for improved abilities like controllability and preference alignment remains a key open problem. Existing reward-guided post-training approaches targeting accelerated few-step generation often deem diffusion distillation losses indispensable. However, in this paper, we identify an interesting yet fundamental paradigm shift: as conditions become more specific, well-designed reward functions emerge as the primary driving force in training strong, few-step image generative models. Motivated by this insight, we introduce Reward-Instruct, a novel and surprisingly simple reward-centric approach for converting pre-trained base diffusion models into reward-enhanced few-step generators. Unlike existing methods, Reward-Instruct does not rely on expensive yet tricky diffusion distillation losses.


Machine Unlearning viaTask Simplex Arithmetic

Neural Information Processing Systems

As foundation Vision-Language Models (VLMs) unlock fine-tuning on smaller datasets while leveraging large-scale pre-training data, machine unlearning becomes critical in addressing privacy concerns and regulatory compliance. Task vector, representing the difference between parameters of models fine-tuned with and without specific data, is a popular retraining-free unlearning strategy. However, we observe that task vectors exhibit substantial sensitivity to various fine-tuning configurations, resulting in unstable unlearning effectiveness that correlates negatively with the prediction-level variance. While aggregating multiple functions (e.g., VLM with classifier) whose parameters are represented by different task vectors reduces function variance and improves unlearning, the computational cost of obtaining numerous task vectors and aggregating functions is computationally high. Thus, in order to capture the space of task vectors induced by diverse fine-tuning strategies, we propose modeling it within the convex hull of (Q 1)-simplex whose vertices represent Q task vectors. Although a function ensemble can be formed by sampling numerous task vectors from such a simplex, we derive a closed-form ensemble of an infinite number of functions whose parameters are uniformly sampled from the simplex, enabling efficient function-level task vector ensembling with enhanced unlearning performance. Extensive experiments and analyses across diverse datasets and scenarios demonstrate the efficacy of our method.


Track3R: Joint Point Map and Trajectory Prior for Spatiotemporal 3DUnderstanding

Neural Information Processing Systems

Understanding the 3D world from 2D monocular videos is a crucial ability for AI. Recently, to tackle this underdetermined task, end-to-end 3D geometry priors have been sought after, such as pre-trained point map models at scale. These models enable robust 3D understanding from casually taken videos, providing accurate object shapes disentangled from uncertain camera parameters. However, they still struggle when affected by object deformation and dynamics, failing to establish consistent correspondence over the frames. Furthermore, their architectures are typically limited to pairwise frame processing, which is insufficient for capturing complex motion dynamics over extended sequences. To address these limitations, we introduce Track3R, a novel framework that integrates a new architecture and task to jointly predict point map and motion trajectories across multiple frames from video input. Specifically, our key idea is modeling two disentangled trajectories for each point: one representing object motion and the other camera poses. This design not only can enable understanding of the 3D object dynamics, but also facilitates the learning of more robust priors for 3D shapes in dynamic scenes. In our experiments, Track3R demonstrates significant improvements in a joint point mapping and 3D motion estimation task for dynamic scenes, such as 25.8% improvements in the motion estimation, and 15.7% in the point mapping accuracy.


Sample-Adaptivity Tradeoff in On-Demand Sampling

Neural Information Processing Systems

We study the tradeoff between sample complexity and round complexity in ondemand sampling, where the learning algorithm adaptively samples from k distributions over a limited number of rounds. In the realizable setting of MultiDistribution Learning (MDL), we show that the optimal sample complexity of an r-round algorithm scales approximately as dkฮ˜(1/r)/ฮต. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of eO((d + k)/ฮต2) within eO( k) rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the eO( k)-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.


Bridging Equivariant GNNs and Spherical CNNs for Structured Physical Domains

Neural Information Processing Systems

Many modeling tasks from disparate domains can be framed in the same way, computing spherical signals from geometric inputs, for example, computing the radar response of different objects or navigating through an environment. This paper introduces G2Sphere, a general method for mapping object geometries to spherical signals. G2Sphere operates entirely in Fourier space, encoding geometric structure into latent Fourier features using equivariant neural networks and outputting the Fourier coefficients of the continuous target signal, which can be evaluated at any resolution. By utilizing a hybrid GNN-spherical CNN architecture, our method achieves a much higher frequency output signal than comparable equivariant GNNs and avoids hand-engineered geometry features used previously by purely spherical methods. We perform experiments on various challenging domains, including radar response modeling, aerodynamic drag prediction, and policy learning for manipulation and navigation. We find that G2Sphere outperforms competitive baselines in terms of accuracy and inference time, and we demonstrate that equivariance and Fourier features lead to improved sample efficiency and generalization.


Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

Neural Information Processing Systems

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such as learning rate grafting and stale preconditioning to achieve performance at-scale. These heuristics increase algorithmic complexity, necessitate further hyperparameter tuning, and lack theoretical justification. This paper investigates these heuristics from the angle of Frobenius norm approximation to full-matrix Adam and decouples the preconditioner's eigenvalues and eigenbasis updates. We show that grafting from Adam mitigates the staleness and mis-scaling of the preconditioner's eigenvalues and how correcting the eigenvalues directly eliminates the need for learning rate grafting. To manage the error induced by infrequent eigenbasis computations, we propose an adaptive criterion for determining the eigenbasis computation frequency motivated by terminating a warm-started QR algorithm. This criterion decouples the update frequency of different preconditioner matrices and enables us to investigate the impact of approximation error on convergence. These practical techniques offer a principled angle towards removing Shampoo's heuristics and developing improved Kronecker-factorization-based training algorithms.


Color Conditional Generation with Sliced Wasserstein Guidance

Neural Information Processing Systems

We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-ofthe-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt.


Energy Loss Functions for Physical Systems

Neural Information Processing Systems

Effectively leveraging prior knowledge of a system's physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions.


SMARTraj2: AStable Multi-City Adaptive Method for Multi-View Spatio-Temporal Trajectory Representation Learning

Neural Information Processing Systems

Spatio-temporal trajectory representation learning plays a crucial role in various urban applications such as transportation systems, urban planning, and environmental monitoring. Existing methods can be divided into single-view and multi-view approaches, with the latter offering richer representations by integrating multiple sources of spatio-temporal data. However, these methods often struggle to generalize across diverse urban scenes due to multi-city structural heterogeneity, which arises from the disparities in road networks, grid layouts, and traffic regulations across cities, and the amplified seesaw phenomenon, where optimizing for one city, view, or task can degrade performance in others. These challenges hinder the deployment of trajectory learning models across multiple cities, limiting their realworld applicability. In this work, we propose SMARTraj2, a novel stable multi-city adaptive method for multi-view spatio-temporal trajectory representation learning. Specifically, we introduce a feature disentanglement module to separate domaininvariant and domain-specific features, and a personalized gating mechanism to dynamically stabilize the contributions of different views and tasks. Our approach achieves superior generalization across heterogeneous urban scenes while maintaining robust performance across multiple downstream tasks. Extensive experiments on benchmark datasets demonstrate the effectiveness of SMARTraj2 in enhancing cross-city generalization and outperforming state-of-the-art methods.