Embedding And Clustering Your Data Can Improve Contrastive Pretraining
–arXiv.org Artificial Intelligence
Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
arXiv.org Artificial Intelligence
Jul-26-2024
- Country:
- Asia
- China (0.04)
- Japan > Honshū
- Chūbu > Nagano Prefecture
- Nagano (0.04)
- Kantō > Tokyo Metropolis Prefecture
- Tokyo (0.04)
- Chūbu > Nagano Prefecture
- Russia (0.04)
- Europe
- North America
- Canada (0.04)
- United States
- New York > New York County
- Manhattan (0.04)
- New York City (0.04)
- Montana > Flathead County
- Kalispell (0.14)
- North Carolina > Forsyth County
- Winston-Salem (0.04)
- California > Alameda County (0.04)
- Texas
- Dallas County > Irving (0.04)
- Galveston County > Texas City (0.04)
- Tennessee > Haywood County (0.04)
- Alabama (0.04)
- Florida
- Broward County > Fort Lauderdale (0.04)
- Miami-Dade County > Miami (0.04)
- Palm Beach County > Boca Raton (0.04)
- Mississippi > Jackson County
- Ocean Springs (0.04)
- Illinois > Champaign County
- Urbana (0.04)
- New York > New York County
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Banking & Finance (1.00)
- Education > Health & Safety
- School Nutrition (0.93)
- Government (1.00)
- Health & Medicine
- Consumer Health (1.00)
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area > Infections and Infectious Diseases (0.68)
- Law (1.00)
- Leisure & Entertainment (1.00)
- Transportation (0.68)
- Technology: