ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Ma, Ke, Long, Jun, Fei, Hongxiao, Hua, Liujie, Dai, Zhen, Luo, Yueyi

Oct-13-2025–arXiv.org Artificial Intelligence

ABSTRACT Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. Index T erms-- anomaly detection, multimodal feature fusion, vision-language model, transfer learning, PEFT 1. INTRODUCTION Zero-Shot Anomaly Detection (ZSAD) adapts Vision-Language Models (VLMs) [1, 2] like CLIP [3] to circumvent the extensive training data required by traditional methods [4, 5].

data mining, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

Oct-13-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report (0.82)

Industry:
- Health & Medicine > Diagnostic Medicine (0.47)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Anomaly Detection (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found