Goto

Collaborating Authors

 Country


OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Neural Information Processing Systems

Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The "Online" aspect emphasizes the need to process and reason over incrementally acquired observations, while the "Spatio-Temporal" component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OSTBench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows.


Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

Neural Information Processing Systems

Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for uncertainty in class prevalences and domain-specific cost asymmetries. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.




10b7e27c8eb9571fbbd2ae6a9f8c3855-Paper-Conference.pdf

Neural Information Processing Systems

While class of methods generati e v xist e models for aligning - with flo human w matching preferences, models existing - a popular approaches and eff f ecti ail v to e achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The finetuned key idea velocity behind field this and algorithm the pretrained is that one the should optimal be matched difference with between the gradient the field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-toimage matching flow models matching under model, limited Stable computational Diffusion 3, b that udgets our while method achie can ving finetune effecti flo v w e and prior-preserving alignment.


COS3D: Collaborative Open-Vocabulary 3DSegmentation

Neural Information Processing Systems

Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications, i.e., novel image-based 3D segmentation, hierarchical segmentation, and robotics.


Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLMCaptioning

Neural Information Processing Systems

Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce Value-guided Inference with Margin-based Reward (ViMaR)1, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a marginaware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4 speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B generalizes effectively to guide decoding in stronger unseen models. To further validate this, we adapt ViMaR to steer generation in both LLaVAOneVision-Qwen2-7B and Qwen2.5-VL-3B,


Watch: Protesters clash with police ahead of G7 summit in Geneva

BBC News

Protesters clashed with police forces during a demonstration against the upcoming G7 summit in Geneva. Tear gas and a water cannon were deployed to disperse the large crowd after protesters smashed windows and set a car on fire. What needs to be understood is the message, the basic message regarding all these countries that oppress us through money and power, said one protester who was disappointed to see the protest turn violent. The G7 summit starts on 15 June in ร‰vian-les-Bains and will bring together the leaders of Britain, France, Canada, Germany, Italy, Japan, the United States and the European Union. Pope Leo XIV says Barcelona's iconic Sagrada Famรญlia is a masterpiece of stones, colours and light during his visit to Spain.


U-REPA: Aligning Diffusion U-Nets to ViTs

Neural Information Processing Systems

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hiddenstates with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach FID < 1.5 in 200 epochs or 1M iterations on ImageNet 256 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema.


ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model

Neural Information Processing Systems

Generative modeling of non-negative, discrete data, such as symbolic music, remains challenging due to two persistent limitations in existing methods. First, most approaches rely on modeling continuous embeddings, which are not wellsuited for inherently discrete data distributions.