MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Xing, Sen, Zhong, Muyan, Lai, Zeqiang, Li, Liangchen, Liu, Jiawen, Wang, Yaohui, Dai, Jifeng, Wang, Wenhai

Dec-2-2024–arXiv.org Artificial Intelligence

In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (38.61 for English vs. 37.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

adapter, text encoder, translation, (15 more...)

arXiv.org Artificial Intelligence

Dec-2-2024

arXiv.org PDF

Add feedback

Country:
- Europe > France
  - Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)
  - Hong Kong (0.04)
  - Beijing > Beijing (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.97)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found