AITopics | Image Processing

Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus}.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2503.22782

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Soni, Achint, Soni, Meet, Rambhatla, Sirisha

arXiv.org Artificial IntelligenceMar-28-2025

T ext-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. LOCATEdit consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks.

artificial intelligence, editing, machine learning, (11 more...)

arXiv.org Artificial Intelligence

2503.21541

Genre: Research Report (0.82)

Industry: Media (0.38)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Enhancing DeepLabV3+ to Fuse Aerial and Satellite Images for Semantic Segmentation

Berka, Anas, Hajji, Mohamed El, Canals, Raphael, Es-saady, Youssef, Hafiane, Adel

arXiv.org Artificial IntelligenceMar-28-2025

Aerial and satellite imagery are inherently complementary remote sensing sources, offering high-resolution detail alongside expansive spatial coverage. However, the use of these sources for land cover segmentation introduces several challenges, prompting the development of a variety of segmentation methods. Among these approaches, the DeepLabV3+ architecture is considered as a promising approach in the field of single-source image segmentation. However, despite its reliable results for segmentation, there is still a need to increase its robustness and improve its performance. This is particularly crucial for multimodal image segmentation, where the fusion of diverse types of information is essential. An interesting approach involves enhancing this architectural framework through the integration of novel components and the modification of certain internal processes. In this paper, we enhance the DeepLabV3+ architecture by introducing a new transposed conventional layers block for upsampling a second entry to fuse it with high level features. This block is designed to amplify and integrate information from satellite images, thereby enriching the segmentation process through fusion with aerial images. For experiments, we used the LandCover.ai (Land Cover from Aerial Imagery) dataset for aerial images, alongside the corresponding dataset sourced from Sentinel 2 data. Through the fusion of both sources, the mean Intersection over Union (mIoU) achieved a total mIoU of 84.91% without data augmentation.

artificial intelligence, available, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.22909

Country: Europe (0.93)

Genre: Research Report > Promising Solution (0.34)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.57)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.67)

Add feedback

On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection

Neural Information Processing SystemsMar-27-2025, 16:26:43 GMT

Successful detection of Out-of-Distribution (OoD) data is becoming increasingly important to ensure safe deployment of neural networks. One of the main challenges in OoD detection is that neural networks output overconfident predictions on OoD data, make it difficult to determine OoD-ness of data solely based on their predictions. Outlier exposure addresses this issue by introducing an additional loss that encourages low-confidence predictions on OoD data during training. While outlier exposure has shown promising potential in improving OoD detection performance, all previous studies on outlier exposure have been limited to utilizing visual outliers.

data mining, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > Middle East (0.46)
North America > United States > Minnesota (0.28)
Europe > Switzerland (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

Appendix A Evaluated Models B Involved Datasets C Construction Process of QA Pairs

Neural Information Processing SystemsMar-27-2025, 16:23:20 GMT

WARNING: The Appendix contains model outputs that may be considered offensive.

machine learning, natural language, question answering, (15 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.51)

Add feedback

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Neural Information Processing SystemsMar-27-2025, 16:23:19 GMT

Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to Comprehensively evAluate the tRustworthinESs of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https://cares-ai.github.io/. WARNING: This paper contains model outputs that may be considered offensive.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > Thailand (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Oncology (1.00)
(5 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
(2 more...)

Add feedback

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

Neural Information Processing SystemsMar-27-2025, 16:23:10 GMT

Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, e.g., multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel Intervention-Driven Relation Network (IDRNet), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes.

artificial intelligence, machine learning, representation, (16 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland (0.28)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Diagnostic Medicine (0.54)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

a1f0c0cd6caaa4863af5f12608edf63e-Paper-Conference.pdf

Neural Information Processing SystemsMar-27-2025, 16:21:34 GMT

artificial intelligence, computer vision and pattern recognition, machine learning, (13 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.67)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

a1c716638d9b618a1a40a96f473c8250-Paper-Conference.pdf

Neural Information Processing SystemsMar-27-2025, 16:17:20 GMT

adversarial attack, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Industry: Information Technology > Security & Privacy (0.73)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
(2 more...)

Add feedback

Towards Robust Blind Face Restoration with Codebook Lookup Transformer

Neural Information Processing SystemsMar-27-2025, 16:14:01 GMT

Blind face restoration is a highly ill-posed problem that often requires auxiliary guidance to 1) improve the mapping from degraded inputs to desired outputs, or 2) complement high-quality details lost in the inputs. In this paper, we demonstrate that a learned discrete codebook prior in a small proxy space largely reduces the uncertainty and ambiguity of restoration mapping by casting blind face restoration as a code prediction task, while providing rich visual atoms for generating highquality faces. Under this paradigm, we propose a Transformer-based prediction network, named CodeFormer, to model the global composition and context of the low-quality faces for code prediction, enabling the discovery of natural faces that closely approximate the target faces even when the inputs are severely degraded. To enhance the adaptiveness for different degradation, we also propose a controllable feature transformation module that allows a flexible trade-off between fidelity and quality. Thanks to the expressive codebook prior and global modeling, CodeFormer outperforms the state of the arts in both quality and fidelity, showing superior robustness to degradation. Extensive experimental results on synthetic and realworld datasets verify the effectiveness of our method.

artificial intelligence, machine learning, restoration, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Filters

Collaborating Authors

Image Processing

Patronus: Bringing Transparency to Diffusion Models with Prototypes

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Enhancing DeepLabV3+ to Fuse Aerial and Satellite Images for Semantic Segmentation

On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection

Appendix A Evaluated Models B Involved Datasets C Construction Process of QA Pairs

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

a1f0c0cd6caaa4863af5f12608edf63e-Paper-Conference.pdf

a1c716638d9b618a1a40a96f473c8250-Paper-Conference.pdf

Towards Robust Blind Face Restoration with Codebook Lookup Transformer