AITopics | visual prompt

Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts

Neural Information Processing SystemsJun-23-2026, 10:12:51 GMT

Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.

artificial intelligence, machine learning, proceedings, (15 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Test-Time Adaptive Object Detection with Foundation Model

Neural Information Processing SystemsJun-19-2026, 11:03:28 GMT

In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based MeanTeacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data.

artificial intelligence, large language model, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts

Neural Information Processing SystemsJun-13-2026, 09:45:50 GMT

Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model's over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction.

artificial intelligence, knowledge management, machine learning, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.63)
Information Technology > Knowledge Management > Knowledge Engineering (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.60)
Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Appendix Implementation Details

Neural Information Processing SystemsApr-25-2026, 05:11:15 GMT

A.1 Network Architectures We adopt Daformer [17] with Swin-B or MiT-B5 backbone as the base semantic segmentation architecture. For the segmentation head, we utilize the same head as Daformer [17]. The stem module contains one fully-convolutional layers with kernel 3 3 and stride of 2, two fully-convolutional layers with kernel 3 3 and stride of 1, two fully-convolutional layers with kernel 3 3 and stride of 2, and another three fully-convolutional layers with kernel 1 1 and stride of 1 to adjust channels of different feature maps. Level embedding module is defined as metrics with shape 3 dims. The prompt Interactor module contains three fully-convolutional layers with kernel 3 3 and stride of 2 to adjust feature dimensions.

artificial intelligence, machine learning, standard single source, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

157c30da6a988e1cbef2095f7b9521db-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 05:11:06 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Neural Information Processing SystemsMar-21-2026, 09:44:49 GMT

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.

artificial intelligence, name change, proceedings, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

Neural Information Processing SystemsMar-17-2026, 21:02:42 GMT

Existing promptable segmentation methods in the medical imaging field primarily consider either textual or visual prompts to segment relevant objects, yet they often fall short when addressing anomalies in medical images, like tumors, which may vary greatly in shape, size, and appearance. Recognizing the complexity of medical scenarios and the limitations of textual or visual prompts, we propose a novel dual-prompt schema that leverages the complementary strengths of visual and textual prompts for segmenting various organs and tumors. Specifically, we introduce $\textbf{\textit{CAT}}$, an innovative model that $\textbf{C}$oordinates $\textbf{A}$natomical prompts derived from 3D cropped images with $\textbf{T}$extual prompts enriched by medical domain knowledge. The model architecture adopts a general query-based design, where prompt queries facilitate segmentation queries for mask prediction. To synergize two types of prompts within a unified framework, we implement a ShareRefiner, which refines both segmentation and prompt queries while disentangling the two types of prompts. Trained on a consortium of 10 public CT datasets, $\textbf{\textit{CAT}}$ demonstrates superior performance in multiple segmentation tasks. Further validation on a specialized in-house dataset reveals the remarkable capacity of segmenting tumors across multiple cancer stages. This approach confirms that coordinating multimodal prompts is a promising avenue for addressing complex scenarios in the medical domain.

artificial intelligence, name change, proceedings, (8 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.59)

Technology: Information Technology > Artificial Intelligence (0.76)

Add feedback

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

Neural Information Processing SystemsFeb-16-2026, 12:58:11 GMT

Honguk Woo is the corresponding author.

domain factor, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Robots (0.96)
(2 more...)

Add feedback

a547d86953a4e36aa8a1390e6f4708e2-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 08:13:48 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(6 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

OMG-LLaV A: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Neural Information Processing SystemsFeb-16-2026, 07:00:00 GMT

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Filters

Collaborating Authors

visual prompt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts

Test-Time Adaptive Object Detection with Foundation Model

Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts

Appendix Implementation Details

157c30da6a988e1cbef2095f7b9521db-Paper-Conference.pdf

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

a547d86953a4e36aa8a1390e6f4708e2-Paper-Conference.pdf

OMG-LLaV A: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding