AITopics | Vision

Collaborating Authors

Vision

"What exactly is computer vision then? Computer vision is a research field working to equip computers with the ability to process and understand visual data, as sighted humans can. Human brains process the gigabytes of data passing through our eyes every second and translate that data into sight - that is, into discrete objects and entities we can recognise or understand. Similarly, computer vision aims to give computers the ability to understand what they are seeing, and act intelligently on that knowledge."
– Computer vision: Cheat Sheet. ZDNet.com (December 6, 2011), by Natasha Lomas.

News Overviews Instructional Materials AI-Alerts Classics

Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

Neural Information Processing SystemsMay-31-2025, 12:11:36 GMT

Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serializationbased methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.

artificial intelligence, detection, machine learning, (12 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology > Services (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models Yang Jiao 1,2,3

Neural Information Processing SystemsMay-31-2025, 12:08:06 GMT

Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to language-oriented formats. This adaptation leads to the convenient development of such LMMs with minimal modifications, however, it overlooks the inductive biases within diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, which decouples the learning of perception capabilities into task-agnostic and task-specific stages. Firstly, Lumen promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all visioncentric tasks we address in this paper. Afterward, the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Add feedback

Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

Neural Information Processing SystemsMay-31-2025, 12:02:06 GMT

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, making it to generate more human-preferred images. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges, we first leverage multimodal large language models to create VisionPrefer, a fine-grained preference dataset that captures multiple preference aspects (prompt-following, aesthetic, fidelity, and harmlessness). Then we train a corresponding reward model, VP-Score, over VisionPrefer to guide the tuning of text-to-image generative models. The preference prediction accuracy of VP-Score is validated to be comparable to that of human annotators.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Media > Photography (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Add feedback

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE Department of Electronic Engineering, Tsinghua University Fanbin Mo

Neural Information Processing SystemsMay-31-2025, 11:58:40 GMT

Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multitask optimization in MLLMs, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector in MLLMs.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Segment Any Change

Neural Information Processing SystemsMay-31-2025, 11:57:04 GMT

Visual foundation models have achieved remarkable results in zero-shot image classification and segmentation, but zero-shot change detection remains an open problem. In this paper, we propose the segment any change models (AnyChange), a new type of change detection model that supports zero-shot prediction and generalization on unseen change types and data distributions. AnyChange is built on the segment anything model (SAM) via our training-free adaptation method, bitemporal latent matching.

change detection, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

93f250215e4889119807b6fac3a57aec-Paper-Conference.pdf

Neural Information Processing SystemsMay-31-2025, 11:52:02 GMT

To date, Transformer-based frameworks have demonstrated impressive results in single-image super-resolution (SISR). However, under practical lightweight scenarios, the complex interaction of deep image feature extraction and similarity modeling limits the performance of these methods, since they require simultaneous layer-specific optimization of both two tasks. In this work, we introduce a novel Unified Projection Sharing (UPS) algorithm to decouple the feature extraction and similarity modeling. To achieve this, we establish a unified projection space defined by a learnable projection matrix, for similarity calculation across all self-attention layers. As a result, deep image feature extraction remains a per-layer optimization manner, while similarity modeling is carried out by projecting these image features onto the shared projection space. Extensive experiments demonstrate that our proposed UPS achieves state-of-the-art performance relative to leading lightweight SISR methods, as verified by various popular benchmarks. Moreover, our unified optimized projection space exhibits encouraging robustness performance for unseen data (degraded and depth images). Finally, UPS also demonstrates promising results across various image restoration tasks, including real-world and classic SISR, image denoising, and image deblocking.

artificial intelligence, feature extraction, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe (0.28)
Asia > China > Guangdong Province (0.14)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Neural Information Processing SystemsMay-31-2025, 11:48:25 GMT

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features.

contribution, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States > Maryland (0.14)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

3D Gaussian Splatting as Markov Chain Monte Carlo

Neural Information Processing SystemsMay-31-2025, 11:43:12 GMT

While 3D Gaussian Splatting has recently become popular for neural rendering, current methods rely on carefully engineered cloning and splitting strategies for placing Gaussians, which can lead to poor-quality renderings, and reliance on a good initialization. In this work, we rethink the set of 3D Gaussians as a random sample drawn from an underlying probability distribution describing the physical representation of the scene--in other words, Markov Chain Monte Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian updates can be converted as Stochastic Gradient Langevin Dynamics (SGLD) update by simply introducing noise. We then rewrite the densification and pruning strategies in 3D Gaussian Splatting as simply a deterministic state transition of MCMC samples, removing these heuristics from the framework. To do so, we revise the'cloning' of Gaussians into a relocalization scheme that approximately preserves sample probability. To encourage efficient use of Gaussians, we introduce a regularizer that promotes the removal of unused Gaussians. On various standard evaluation scenes, we show that our method provides improved rendering quality, easy control over the number of Gaussians, and robustness to initialization.

artificial intelligence, gaussian, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Chūbu (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Multi-hypotheses Conditioned Point Cloud Diffusion for 3D Human Reconstruction from Occluded Images

Neural Information Processing SystemsMay-31-2025, 11:38:41 GMT

SMPL(- X), which are based on the statistics across human shapes, can represent whole human body shapes but are limited to minimally-clothed human shapes. Implicitfunction-based methods extract features from the parametric models to employ prior knowledge of human bodies and can capture geometric details such as clothing and hair. However, they often struggle to handle misaligned parametric models and inpaint occluded regions given a single RGB image.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.67)
Health & Medicine (0.55)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
(2 more...)

Add feedback

FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images

Neural Information Processing SystemsMay-31-2025, 11:37:07 GMT

Figure 1: Results of facial parts swapping using the proposed FuseAnyPart at 512 512 resolution. The swapped face (central image) is generated by fusing the original face (top-left image) with three facial part reference images (bottom-left, top-right, bottom-right). Notably, FuseAnyPart can seamlessly blend facial parts from multiple reference images with significant differences in appearance, producing high-fidelity and natural-looking swapped faces.

artificial intelligence, facial part, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.67)

Technology: