Goto

Collaborating Authors

 Technology


Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Neural Information Processing Systems

This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a Co-Reinforcement Learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefits of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.


Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Neural Information Processing Systems

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve Vision-Language Models (VLMs) reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's generalization ability to transfer visual reasoning skills under domain shift and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, the first two-stage reinforcement fine-tuning framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs, followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing the capability to address ubiquitous domain shift in visual reasoning tasks. To evaluate the visual reasoning capabilities of Reason-RFT, we reconstructed a comprehensive dataset encompassing visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three core dimensions. Experimental results demonstrate three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance in addressing domain shift in typical visual reasoning tasks, outperforming alternative paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines. Reason-RFT introduces a rebust training paradigm in visual reasoning, and please refer to project website: Reason-RFT.


3DEquivariant Visuomotor Policy Learning via Spherical Projection

Neural Information Processing Systems

Equivariant models have recently been shown to improve the data efficiency of diffusion policy by a significant margin. However, prior work that explored this direction focused primarily on point cloud inputs generated by multiple cameras fixed in the workspace. This type of point cloud input is not compatible with the now-common setting where the primary input modality is an eye-in-hand RGB camera like a GoPro. This paper closes this gap by incorporating into the diffusion policy model a process that projects features from the 2DRGB camera image onto a sphere. This enables us to reason about symmetries in SO(3)without explicitly reconstructing a point cloud. We perform extensive experiments in both simulation and the real world that demonstrate that our method consistently outperforms strong baselines in terms of both performance and sample efficiency. Our work, Image-toSphere Policy (ISP), is the first SO(3)-equivariant policy learning framework for robotic manipulation that works using only monocular RGB inputs.




SHGR: AGeneralized Maximal Correlation Coefficient

Neural Information Processing Systems

Traditional correlation measures, such as Pearson's and Spearman's coefficients, are limited in their ability to capture complex relationships, particularly nonlinear and multivariate dependencies. The Hirschfeld-Gebelein-Rényi (HGR) maximal correlation offers a powerful alternative by measuring the highest Pearson correlation achievable through nonlinear transformations of two random variables. However, estimating the HGR coefficient remains challenging due to the complexity of optimizing arbitrary nonlinear functions. We introduce a new coefficient, satisfying Rényi's axioms, based on the extension of HGR with Spearman's rank correlation: the Spearman HGR (SHGR). We propose a neural network-based estimator tailored to estimate (i) the bivariate correlation matrix, (ii) the multivariate correlations between a set of variables and another one, and (iii) the full correlation between two sets of variables. This estimate effectively detects nonlinear dependencies and demonstrates robustness to noise, outliers, and spurious correlations (hallucinations). Additionally, it achieves competitive computational efficiency through designed neural architectures. Comprehensive numerical experiments and feature selection tasks confirm that SHGRoutperforms existing state-of-the-art methods.


List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression

Neural Information Processing Systems

We study a relaxation of the problem of coupling probability distributions -- a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbelmax sampling suggested in Daliri et al. [9] for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr [38] and SpecInfer [34] across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.


Efficient Large Language Model Inference with Neural Block Linearization

Neural Information Processing Systems

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pretrained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. The implementation is available at: https://github.com/LIONS-EPFL/NBL.


Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Neural Information Processing Systems

Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2(DD2), to further advance the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model that provides the ground truth conditional score in the latent embedding space at each token position.


085ea366002345cab8a1bf0f0ad1b210-Paper-Conference.pdf

Neural Information Processing Systems

Recent years have witnessed the emergence of a spectrum of foundation models, covering a broad range of capabilities and costs. Often, we effectively use foundation models as feature generators and train classifiers that use the outputs of these models to make decisions. In this paper, we consider an increasingly relevant setting where we have two classifier stages. The first stage has access to features x and has the option to make a classification decision or defer, while incurring a cost, to a second classifier that has access to features x and z. This is similar to the "learning to defer" setting, with the important difference that we train both classifiers jointly, and the second classifier has access to more information. The natural loss for this setting is an ℓ01c loss, where a penalty is paid for incorrect classification, as in ℓ01, but an additional penalty cis paid for consulting the second classifier. The ℓ01c loss is unwieldy for training. Our primary contribution in this paper is the derivation of a hinge-based surrogate loss ℓchinge that is much more amenable to training but also satisfies the property that ℓchinge-consistency implies ℓ01c-consistency.