TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter
Zhang, Xinyang, Malkov, Yury, Florez, Omar, Park, Serim, McWilliams, Brian, Han, Jiawei, El-Kishky, Ahmed
–arXiv.org Artificial Intelligence
Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.
arXiv.org Artificial Intelligence
Aug-26-2023
- Country:
- North America
- United States
- District of Columbia > Washington (0.04)
- Texas > Harris County
- Houston (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Illinois > Champaign County
- Urbana (0.04)
- California > San Francisco County
- San Francisco (0.14)
- Canada > British Columbia
- United States
- Europe
- United Kingdom > England
- Greater London > London (0.04)
- Italy
- Tuscany > Florence (0.04)
- Calabria > Catanzaro Province
- Catanzaro (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- United Kingdom > England
- Asia > South Korea
- Africa > Ethiopia
- Addis Ababa > Addis Ababa (0.04)
- North America
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology > Services (0.68)
- Technology: