Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Cao, Min, Zhou, Xinyu, Jiang, Ding, Du, Bo, Ye, Mang, Zhang, Min
–arXiv.org Artificial Intelligence
Abstract--T ext-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity . Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. T o alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity . The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. The task is similar to the person re-identification task (Re-ID) [2], [3], [4], which involves identifying person images across cameras based on the image query . In contrast to the structured image query in Re-ID, the text query in TIPR takes the form of free, flexible characters, making it more accessible and offering substantial application potential in public safety domains. A key challenge in TIPR is the inherent modality gap between vision and language, driving research toward robust cross-modal alignment. The former aligns global text-image representations at the coarse-grained level via cross-modal matching loss functions (Figure 1(a)), while the latter establishes fine-grained associations between textual entities and image body parts (Figure 1(b)). Despite notable progress in this task, two critical issues remain to be addressed.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Asia > China
- Beijing > Beijing (0.04)
- Heilongjiang Province > Harbin (0.04)
- Hong Kong (0.04)
- Hubei Province > Wuhan (0.05)
- Europe > Germany
- Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States
- Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China
- Genre:
- Research Report (1.00)
- Industry:
- Education > Educational Setting (0.46)
- Technology: