Goto

Collaborating Authors

 Genre


Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Neural Information Processing Systems

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for highresolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because "optimal" keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement-learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.


Interpreta view of the lighthouseandsky person works at his desk in officedifferent concepts(a)(b)(c)Vision RepresentationLanguage RepresentationConcept Activationthe same concept

Neural Information Processing Systems

However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.


AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time Sarthak Kumar Maharana Saksham Singh Kushwaha Baoming Zhang Adrian Rodriguez Songtao Wei Yapeng Tian

Neural Information Processing Systems

AVROBUSTBENCH comprises four audio-visual benchmark datasets, AUDIOSET-2C, VGGSOUND-2C, KINETICS-2C, and EPICKITCHENS-2C, each incorporating 75 bimodal audio-visual corruptions that are co-occurring and correlated. Through extensive evaluations, we observe that state-of-the-art supervised and severity self-supervised increases.


Differential Privacy for Euclidean Jordan Algebra with Applications to Private Symmetric Cone Programming

Neural Information Processing Systems

In this paper, we study differentially private mechanisms for functions whose outputs lie in a Euclidean Jordan algebra. Euclidean Jordan algebras capture many important mathematical structures and form the foundation of linear programming, second-order cone programming, and semidefinite programming. Our main contribution is a generic Gaussian mechanism for such functions, with sensitivity measured in โ„“2, โ„“1, and โ„“ norms. Notably, this framework includes the important case where the function outputs are symmetric matrices, and sensitivity is measured in the Frobenius, nuclear, or spectral norm. We further derive private algorithms for solving symmetric cone programs under various settings, using a combination of the multiplicative weights update method and our generic Gaussian mechanism. As an application, we present differentially private algorithms for semidefinite programming, resolving a major open question posed by [Hsu, Roth, Roughgarden, and Ullman, ICALP 2014].


ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

Neural Information Processing Systems

With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing opensource frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform.


OS-HARM: ABenchmark for Measuring Safety of Computer Use Agents

Neural Information Processing Systems

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-HARM, a new benchmark for measuring safety of computer use agents. OS-HARM is built on top of the OSWorld environment (Xie et al., 2024) and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior.


PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion

Neural Information Processing Systems

Point cloud completion is essential for robust 3D perception in safety-critical applications such as robotics and augmented reality. However, existing models perform static inference and rely heavily on inductive biases learned during training, limiting their ability to adapt to novel structural patterns and sensor-induced distortions at test time. To address this limitation, we propose PointMAC, a meta-learned framework for robust test-time adaptation in point cloud completion. It enables sample-specific refinement without requiring additional supervision. Our method optimizes the completion model under two self-supervised auxiliary objectives that simulate structural and sensor-level incompleteness.


Understanding Generalization in Physics Informed Models through Affine Variety Dimensions

Neural Information Processing Systems

Physics-informed machine learning is gaining significant traction for enhancing statistical performance and sample efficiency through the integration of physical knowledge. However, current theoretical analyses often presume complete prior knowledge in non-hybrid settings, overlooking the crucial integration of observational data, and are frequently limited to linear systems, unlike the prevalent nonlinear nature of many real-world applications. To address these limitations, we introduce a unified residual form that unifies collocation and variational methods, enabling the incorporation of incomplete and complex physical constraints in hybrid learning settings. Within this formulation, we establish that the generalization performance of physics-informed regression in such hybrid settings is governed by the dimension of the affine variety associated with the physical constraint, rather than by the number of parameters. This enables a unified analysis that is applicable to both linear and nonlinear equations. We also present a method to approximate this dimension and provide experimental validation of our theoretical findings.


Fixed-Point RNNs: Interpolating from Diagonal to Dense

Neural Information Processing Systems

Linear recurrent neural networks (RNNs) and state-space models (SSMs) such as Mamba have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures. Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e.


RepGuard: Adaptive Feature Decoupling for Robust Backdoor Defense in Large Language Models

Neural Information Processing Systems

Backdoor attacks pose a significant threat to large language models (LLMs) by embedding malicious triggers that manipulate model behavior. However, existing defenses primarily rely on prior knowledge of backdoor triggers or targets and offer only superficial mitigation strategies, thus struggling to fundamentally address the inherent reliance on unreliable features. To address these limitations, we propose a novel defense strategy, RepGuard, that strengthens LLM resilience by adaptively separating abnormal features from useful semantic representations, rendering the defense agnostic to specific trigger patterns. Specifically, we first introduce a dual-perspective feature localization strategy that integrates local consistency and sample-wise deviation metrics to identify suspicious backdoor patterns. Based on this identification, an adaptive mask generation mechanism is applied to isolate backdoor-targeted shortcut features by decomposing hidden representations into independent spaces, while preserving task-relevant semantics.