Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
–arXiv.org Artificial Intelligence
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.
arXiv.org Artificial Intelligence
Nov-10-2025
- Country:
- Asia
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- South Korea > Seoul
- Seoul (0.06)
- Taiwan > Taiwan Province
- Taipei (0.04)
- Myanmar > Tanintharyi Region
- Europe > Switzerland (0.04)
- North America > United States
- District of Columbia > Washington (0.04)
- New York > New York County
- New York City (0.05)
- Texas > Travis County
- Austin (0.04)
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.86)
- Industry:
- Information Technology > Services > e-Commerce Services (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.68)
- Natural Language
- Large Language Model (0.89)
- Text Processing (0.93)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence