Empowering Visible-Infrared Person Re-Identification with Large Foundation Models Bin Yang
–Neural Information Processing Systems
Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal retrieval task due to significant modality differences, primarily resulting from the absence of color information in the infrared modality. The development of large foundation models like Large Language Models (LLMs) and Vision Language Models (VLMs) motivates us to explore a feasible solution to empower VI-ReID with off-the-shelf large foundation models. To this end, we propose a novel Textenhanced VI-ReID framework driven by Large Foundation Models (TVI-LFM). The core idea is to enrich the representation of the infrared modality with textual descriptions automatically generated by VLMs. Specifically, we incorporate a pre-trained VLM to extract textual features from texts generated by VLM and augmented by LLM, and incrementally fine-tune the text encoder to minimize the domain gap between generated texts and original visual modalities. Meanwhile, to enhance the infrared modality with extracted textual representations, we leverage modality alignment capabilities of VLMs and VLM-generated feature-level filters.
Neural Information Processing Systems
Mar-27-2025, 09:47:18 GMT
- Country:
- Asia > China > Hubei Province (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Information Technology (0.67)
- Law (0.67)
- Technology: