Goto

Collaborating Authors

 consistency


OmniZoom: AUniversal Plug-and-Play Paradigm for Cross-Device Smooth Zoom Interpolation

Neural Information Processing Systems

Dual-camera smartphones suffer from geometric and photometric inconsistencies during zoom transitions, primarily due to disparities in intrinsic/extrinsic parameters and divergent image processing pipelines between the two cameras. Existing interpolation methods struggle to effectively address this issue, constrained by the lack of ground-truth datasets and motion ambiguity in dynamic scenarios. To overcome these challenges, we propose OmniZoom, a universal plug-and-play paradigm for cross-device smooth zoom interpolation. Specifically, we present a novel cross-device virtual data generation method utilizing 3DGaussian Splatting. This method tackles data scarcity by decoupling geometric features via spatial transition modeling and correcting photometric variations with dynamic color adaptation. It is further enhanced by cross-domain consistency learning for device-agnostic semantic alignment.


GauSAM: Contour-Guided 2DGaussian Fields for Multi-Scale Medical Image Segmentation with Segment Anything

Neural Information Processing Systems

Effective multiscale medical image segmentation requires simultaneously preserving smooth spatial continuity and accurately delineating high-frequency boundaries, yet pixel-wise decoders often fail to maintain this balance consistently across varying resolutions. We introduce GauSAM, which seamlessly integrates contour-guided 2DGaussian probability fields into the Segment Anything Model to address these challenges. In our framework, segmentation masks are parameterized as continuous probability fields of learnable 2DGaussian primitives, enforcing spatially smooth and structurally consistent. Contourlet transforms extract rich multidirectional frequency information, notably edges and fine textures, which dynamically guide the spatial distribution of Gaussian primitives to substantially improve boundary fidelity in complex structures.


Martian World Model: Controllable Video Synthesis with Physically Accurate 3DReconstructions

Neural Information Processing Systems

Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic of high-quality simulation. Martian Howe data ver, and this the task significant poses unique domain challenges gap between due to Martian the scarcity and terrestrial composed imagery of two k .


Solving Inverse Problems with FLAIR

Neural Information Processing Systems

Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach hasnot yet led to comparable fidelity. There are several key obstacles: (i) the datalikelihood term is usually intractable; (ii) learned generative models cannot be directly conditioned on the distorted observations, leading to conflicting objectives between data likelihood and prior; and (iii) the reconstructions can deviate from theobserved data.


Pro3D-Editor: AProgressive-Views Perspective for Consistent and Precise 3DEditing

Neural Information Processing Systems

T gions, ext-guided which 3D has editing significant aims potential to precisely for edit various semantically practical applications relevant local ranging 3D refrom 3D games to film production. Existing methods typically follow a viewindiscriminate paradigm: editing 2D views indiscriminately and projecting them back dencies, into resulting 3D space. in Ho inconsistent wever, the multi-vie y overlook w editing.


FaCT Faithful Concept Traces for Explaining Neural Network Decisions

Neural Information Processing Systems

Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as classspecificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C2-score, that can be used to evaluate concept-based methods. Compared to prior work, we show that our concepts are quantitatively more consistent and that users find them to be more interpretable, while retaining competitive ImageNet performance. 1


RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering

Neural Information Processing Systems

In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models.


FHGS: Feature-Homogenized Gaussian Splatting

Neural Information Processing Systems

Scene understanding based on 3DGaussian Splatting (3DGS) has recently achieved notable advances. Although 3DGS related methods have efficient rendering capabilities, they fail to address the inherent contradiction between the anisotropic color representation of gaussian primitives and the isotropic requirements of semantic features, leading to insufficient cross-view feature consistency. To overcome the limitation, we proposes FHGS (Feature-Homogenized Gaussian Splatting), a novel 3D feature distillation framework inspired by physical models, which freezes and distills 2D pre-trained features into 3D representations while preserving the realtime rendering efficiency of 3DGS. Specifically, our FHGS introduces the following innovations: Firstly, a universal feature fusion architecture is proposed, enabling robust embedding of large-scale pre-trained models' semantic features (e.g., SAM, CLIP) into sparse 3D structures. Secondly, a non-differentiable feature fusion mechanism is introduced, which enables semantic features to exhibit viewpoint independent isotropic distributions.



Motion4D: Learning 3D-Consistent Motion and Semantics for 4DScene Understanding

Neural Information Processing Systems

Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments.