Goto

Collaborating Authors

 Genre


Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation 1,3 1 1 1 3 Shuo Wang, Y

Neural Information Processing Systems

Vision-Language Navigation (VLN) is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex realworld environments. Recent advances driven by large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, reasoning strategies in this task remain underexplored. Navigation is action-centric and long-horizon, while Chain-of-Thought (CoT) reasoning has mainly shown success in static tasks such as visual question answering. To address this gap, we conduct the first systematic evaluation of reasoning strategies, including No-Think (direct action prediction), Pre-Think (reasoning before action), and Post-Think (reasoning after action). Surprisingly, our findings reveal a Test-time Reasoning Collapse issue, where reasoning during testing degrades navigation accuracy, highlighting the challenges of integrating reasoning into embodied navigation.


Massive Sound Embedding Benchmark (MSEB)

Neural Information Processing Systems

Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful'embedding'--be it a single vector, a sequence of continuous or discrete representations, or another structured form--which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth.


Massive Sound Embedding Benchmark (MSEB)

Neural Information Processing Systems

Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful'embedding'--be it a single vector, a sequence of continuous or discrete representations, or another structured form--which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth.


VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Neural Information Processing Systems

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we showcase how to enhance the slow-thinking capabilities of vision-language models using reinforcement learning, to advance the state of the art, without relying on costly distillation. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem.


OmniBench: Towards The Future of Universal Omni-Language Models

Neural Information Processing Systems

Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains underexplored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as the omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities.


Meta Guidance: Incorporating Inductive Biases into Deep Time Series Imputers

Neural Information Processing Systems

Missing values, frequently encountered in time series data, can significantly impair the effectiveness of analytical methods. While deep imputation models have emerged as the predominant approach due to their superior performance, explicitly incorporating inductive biases aligned with time-series characteristics offers substantial improvement potential. Taking advantage of non-stationarity and periodicity in time series, two domain-specific inductive biases are designed: (1) Non-Stationary Guidance, which operationalizes the proximity principle to address highly non-stationary series by emphasizing temporal neighbors, and (2) Periodic Guidance, which exploits periodicity patterns through learnable weight allocation across historical periods. Building upon these complementary mechanisms, the overall module, named Meta Guidance, dynamically fuses both guidances through data-adaptive weights learned from the specific input sample. Experiments on nine benchmark datasets demonstrate that integrating Meta Guidance into existing deep imputation architectures achieves an average 27.39% reduction in imputation error compared to state-of-the-art baselines.


AImplies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Neural Information Processing Systems

Due to the size and complexity of modern large language models (LLMs), it has proven challenging to uncover the underlying mechanisms that models use to solve reasoning problems. For instance, is their reasoning for a specific problem localized to certain parts of the network? Do they break down the reasoning problem into modular components that are then executed as sequential steps as we go deeper in the model? To better understand the reasoning capability of LLMs, we study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution. By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems. From a mechanistic interpretability point of view, we use causal mediation analysis to uncover the pathways and components of the LLMs' reasoning processes. Then, we offer fine-grained insights into the functions of attention heads in different layers. We not only find a sparse circuit that computes the answer, but we decompose it into sub-circuits that have four distinct and modular uses. Finally, we reveal that three distinct models - Mistral-7B, Gemma2-9B and Gemma-2-27B - contain analogous but not identical mechanisms.


Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

Neural Information Processing Systems

A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose Unified Variational Auto-Encoder for 3DMolecular Latent Diffusion Modeling (UAE-3D), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer-a general-purpose diffusion model without any molecular inductive bias-for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both de novo and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6% over the previous best result, while achieving over 70% relative average improvements in geometric fidelity. Our code is released at https://github.com/lyc0930/UAE-3D/.


Robust Explanations of Graph Neural Networks via Graph Curvatures

Neural Information Processing Systems

Explaining graph neural networks (GNNs) is a key approach to improve the trustworthiness of GNN in high-stakes applications, such as finance and healthcare. However, existing methods are vulnerable to perturbations, raising concerns about explanation reliability. Prior methods enhance explanation robustness using model retraining or explanation ensemble, with certain weaknesses. Retraining leads to models that are different from the original target model and misleading explanations, while ensemble can produce contradictory results due to different inputs or models. To improve explanation robustness without the above weaknesses, we take an unexplored route and exploit the two edge geometry properties curvature and resistance to enhance explanation robustness. We are the first to prove that these geometric notions can be used to bound explanation robustness. We design a general optimization algorithm to incorporate these geometric properties into a wide spectrum of base GNN explanation methods to enhance the robustness of base explanations. We empirically show that our method outperforms six base explanation methods in robustness across nine datasets spanning node classification, link prediction, and graph classification tasks, improving fidelity in 80% of the cases and achieving up to a 10% relative improvement in robust performance.


MMCSBench: AFine-Grained Benchmark for Large Vision-Language Models in Camouflage Scenes

Neural Information Processing Systems

Current camouflaged object detection methods predominantly follow discriminative segmentation paradigms and heavily rely on predefined categories present in the training data, limiting their generalization to unseen or emerging camouflage objects. This limitation is further compounded by the labor-intensive and time-consuming nature of collecting camouflage imagery. Although Large VisionLanguage Models (LVLMs) show potential to improve such issues with their powerful generative capabilities, their understanding of camouflage scenes is still insufficient. To bridge this gap, we introduce MMCSBench, the first comprehensive multimodal benchmark designed to evaluate and advance LVLM capabilities in camouflage scenes. MMCSBench comprises 22,537 images and 76,843 corresponding image-text pairs across five fine-grained camouflage tasks. Additionally, we propose a new task, Camouflage Efficacy Assessment (CEA), aimed at quantitatively evaluating the camouflage effectiveness of objects in images and enabling automated collection of camouflage images from large-scale databases. Extensive experiments on 26 LVLMs reveal significant shortcomings in models' ability to perceive and interpret camouflage scenes. These findings highlight the fundamental differences between natural and camouflaged visual inputs, offering insights for future research in advancing LVLM capabilities within this challenging domain.