Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

Wu, Zhengxuan, Tamkin, Alex, Papadimitriou, Isabel

Jan-23-2024–arXiv.org Artificial Intelligence

When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model's downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. %On the other hand, transferring to a dataset with an unaligned vocabulary is extremely hard to recover from in the low-data regime. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.

arxiv preprint arxiv, computational linguistic, tokenizer, (12 more...)

arXiv.org Artificial Intelligence

Jan-23-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - California > Santa Clara County
      - Palo Alto (0.04)
- Europe
  - Ireland (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)

Genre:
- Research Report > Experimental Study (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Machine Translation (0.69)
  - Machine Learning > Transfer Learning (0.42)