Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning

Zhang, Yi, Cheng, Chun-Wun, He, Junyi, Yu, Ke, Tang, Yushun, Schönlieb, Carola-Bibiane, He, Zhihai, Aviles-Rivero, Angelica I.

arXiv.org Artificial Intelligence 

Abstract--Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. T o address this issue, we develop a new adaptation method for large vision-language models, called Training-free Dual Hyperbolic Adapters (T -DHA). We characterize vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincar e ball model, achieving significantly improved representation and discrimination power . Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T -DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks. ARGE Vision-Language Models (VLMs), such as CLIP [1] and ALIGN [2], are trained on extensive image-text datasets using contrastive learning. These models excel in creating a unified vision-language embedding space by aligning visual and textual modalities, enabling their successful application across a wide range of downstream visual tasks, such as few-shot image recognition [3]-[5].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found