Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
Wen, Xin, Zhao, Bingchen, Chen, Yilun, Pang, Jiangmiao, Qi, Xiaojuan
–arXiv.org Artificial Intelligence
Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks.
arXiv.org Artificial Intelligence
Jun-14-2024
- Country:
- Africa > Rwanda
- Asia
- China
- India (0.04)
- Middle East > Israel
- Tel Aviv District > Tel Aviv (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Europe
- Austria > Vienna (0.14)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Poland (0.04)
- Spain
- Andalusia > Granada Province
- Granada (0.04)
- Valencian Community > Valencia Province
- Valencia (0.04)
- Andalusia > Granada Province
- Switzerland > Zürich
- Zürich (0.14)
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.14)
- Ontario > Toronto (0.14)
- British Columbia > Metro Vancouver Regional District
- United States
- California
- Los Angeles County > Long Beach (0.04)
- San Francisco County > San Francisco (0.14)
- North Carolina > Durham County
- Durham (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.05)
- Rhode Island > Providence County
- Providence (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Maryland > Baltimore (0.04)
- Ohio > Franklin County
- Columbus (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- California
- Canada
- Oceania > Australia
- New South Wales > Sydney (0.04)
- Victoria > Melbourne (0.04)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Education (0.67)
- Technology: