Goto

Collaborating Authors

 Industry


MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Neural Information Processing Systems

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information, as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images.


AdaTS Adaptive Time Series Representation Learning through Dynamic Contrasts

Neural Information Processing Systems

Learning robust representations from unlabeled time series is crucial, and contrastive learning offers a promising avenue. However, existing contrastive learning approaches for time series often struggle to define meaningful similarities, tending to overlook inherent physical correlations and diverse, sequence-varying non-stationarity. This limits their representational quality and real-world adaptability. To address these limitations, we introduce AdaTS, a novel adaptive soft contrastive learning strategy. AdaTS offers a computationally efficient solution centered on dynamic instance-wise and temporal assignments that enhance time series representations by: (i) leveraging Time-Frequency Coherence to provide robust, physics-guided similarity measurements; (ii) preserving relative instance similarities through ordinal consistency learning; and (iii) adapting to sequencespecific non-stationarity with dynamic temporal assignments. AdaTS is designed as a pluggable module for standard contrastive frameworks, achieving accuracy improvements of up to 13.7% across diverse time series datasets and three state-ofthe-art contrastive frameworks while enhancing robustness under label scarcity.


Let a Neural Network Be Your Invariant

Neural Information Processing Systems

Safety verification ensures that a system avoids undesired behaviour. Liveness complements safety, ensuring that the system also achieves its desired objectives. A complete specification of functional correctness must combine both safety and liveness. Proving with mathematical certainty that a system satisfies a safety property demands presenting an appropriate inductive invariant of the system, whereas proving liveness requires showing a measure of progress witnessed by a ranking function. Neural model checking has recently introduced a data-driven approach to the formal verification of reactive systems, albeit focusing on ranking functions and thus addressing liveness properties only.


Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Neural Information Processing Systems

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier--trained via multi-step Direct Preference Optimization (DPO)--that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs. Code and datasets are publicly released at https://vts-v.github.io/.


Differentiable Structure Learning and Causal Discovery for General Binary Data

Neural Information Processing Systems

Existing methods for differentiable structure learning in discrete data typically assume that the data are generated from specific structural equation models. However, these assumptions may not align with the true data-generating process, which limits the general applicability of such methods. Furthermore, current approaches often ignore the complex dependence structure inherent in discrete data and consider only linear effects. We propose a differentiable structure learning framework that is capable of capturing arbitrary dependencies among discrete variables. We show that although general discrete models are unidentifiable from purely observational data, it is possible to characterize the complete set of compatible parameters and structures. Additionally, we establish identifiability up to Markov equivalence under mild assumptions. We formulate the learning problem as a single differentiable optimization task in the most general form, thereby avoiding the unrealistic simplifications adopted by previous methods. Empirical results demonstrate that our approach effectively captures complex relationships in discrete data.


MedSG-Bench: ABenchmark for Medical Image Sequences Grounding

Neural Information Processing Systems

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre-vs.


Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

Neural Information Processing Systems

Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representation and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector-Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception-action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning.


Advancing Interpretability of CLIP Representations with Concept Surrogate Model

Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) generates versatile multimodal embeddings for diverse applications, yet the specific information captured within these representations is not fully understood. Current explainability techniques often target specific tasks, overlooking the rich, general semantics inherent in the representations. Our objective is to reveal the concepts encoded in CLIP embeddings by learning a surrogate representation, which is expressed as a linear combination of human-understandable concepts evident in the image. Our method, which we term EXPLAIN-R, introduces a novel approach that leverages CLIP's learned instance-instance similarity to train a surrogate model that faithfully mimics CLIP's behavior. From the trained surrogate, we derive concept scores for each input image; these scores quantify the contribution of each concept and act as the explanation for the representation. Quantitative evaluations on multiple datasets demonstrate our method's superior faithfulness over the baseline. Moreover, a user study confirms that our explanations are perceived as more relevant, complete, and useful. Our work provides a novel approach for interpreting CLIP image representations, enhancing the user interpretability of representations and fostering more trustworthy AI systems.


EFFIBENCH-X: AMulti-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Neural Information Processing Systems

Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EFFIBENCH-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EFFIBENCH-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EFFIBENCH-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around 62% of human efficiency on average, with significant language-specific variations.


Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Neural Information Processing Systems

Despite Multimodal Large Language Models (MLLMs) showing promising results on general zero-shot image classification tasks, fine-grained image classification remains challenging. It demands precise attention to subtle visual details to distinguish between visually similar subcategories--details that MLLMs may easily overlook without explicit guidance. To address this, we introduce AutoSEP, an iterative self-supervised prompt learning framework designed to enhance MLLM fine-grained classification capabilities in a fully unsupervised manner. Our core idea is to leverage unlabeled data to learn a description prompt that guides MLLMs in identifying crucial discriminative features within an image, and boosts classification accuracy. We developed an automatic self-enhancing prompt learning framework called AutoSEP to iteratively improve the description prompt using unlabeled data, based on instance-level classification scoring function. AutoSEP only requires black-box access to MLLMs, eliminating the need for any training or fine-tuning. We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, AutoSEP in average improves 13% over standard zero-shot classification and 3% over the best-performing baselines.