Goto

Collaborating Authors

 Steinert-Threlkeld, Shane


Minimization of Boolean Complexity in In-Context Concept Learning

arXiv.org Artificial Intelligence

What factors contribute to the relative success and corresponding difficulties of in-context learning for Large Language Models (LLMs)? Drawing on insights from the literature on human concept learning, we test LLMs on carefully designed concept learning tasks, and show that task performance highly correlates with the Boolean complexity of the concept. This suggests that in-context learning exhibits a learning bias for simplicity in a way similar to humans.


Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence

arXiv.org Artificial Intelligence

This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.


Targeted Multilingual Adaptation for Low-resource Language Families

arXiv.org Artificial Intelligence

The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.


The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation

arXiv.org Artificial Intelligence

Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.


Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

arXiv.org Artificial Intelligence

Additionally, the informationtheoretic For languages other than English and a handful tokenization modules for cross-lingual of other very high-resource languages, pre-trained models are usually under-optimized for any given multilingual language models form the backbone language, and especially low-resource languages of most current NLP systems. These models address (Ács, 2019; Conneau and Lample, 2019, i.a.) the relative data scarcity in most non-English For this reason, we propose several simple techniques languages by pooling text data across many languages to replace the large cross-lingual vocabulary to train a single model that (in theory) covers of a pre-trained model with a compact, languagespecific all training languages (Devlin, 2019; Conneau one during model specialization. Training and Lample, 2019; Conneau et al., 2020; Liu et al., a new SentencePiece or BPE tokenizer poses no 2020; Scao et al., 2023, i.a.).


Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages

arXiv.org Artificial Intelligence

Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.


Learning to translate by learning to communicate

arXiv.org Artificial Intelligence

We formulate and test a technique to use Emergent Communication (EC) with a pre-trained multilingual model to improve on modern Unsupervised NMT systems, especially for low-resource languages. It has been argued that the current dominant paradigm in NLP of pre-training on text-only corpora will not yield robust natural language understanding systems, and the need for grounded, goal-oriented, and interactive language learning has been high lighted. In our approach, we embed a multilingual model (mBART, Liu et al., 2020) into an EC image-reference game, in which the model is incentivized to use multilingual generations to accomplish a vision-grounded task. The hypothesis is that this will align multiple languages to a shared task space. We present two variants of EC Fine-Tuning (Steinert-Threlkeld et al., 2022), one of which outperforms a backtranslation-only baseline in all four languages investigated, including the low-resource language Nepali.


The Weighted M\"obius Score: A Unified Framework for Feature Attribution

arXiv.org Artificial Intelligence

Feature attribution aims to explain the reasoning behind a black-box model's prediction by identifying the impact of each feature on the prediction. Recent work has extended feature attribution to interactions between multiple features. However, the lack of a unified framework has led to a proliferation of methods that are often not directly comparable. This paper introduces a parameterized attribution framework -- the Weighted M\"obius Score -- and (i) shows that many different attribution methods for both individual features and feature interactions are special cases and (ii) identifies some new methods. By studying the vector space of attribution methods, our framework utilizes standard linear algebra tools and provides interpretations in various fields, including cooperative game theory and causal mediation analysis. We empirically demonstrate the framework's versatility and effectiveness by applying these attribution methods to feature interactions in sentiment analysis and chain-of-thought prompting.


Paying Attention to Function Words

arXiv.org Artificial Intelligence

All natural languages exhibit a distinction between content words (like nouns and adjectives) and function words (like determiners, auxiliaries, prepositions). Yet surprisingly little has been said about the emergence of this universal architectural feature of natural languages. Why have human languages evolved to exhibit this division of labor between content and function words? How could such a distinction have emerged in the first place? This paper takes steps towards answering these questions by showing how the distinction can emerge through reinforcement learning in agents playing a signaling game across contexts which contain multiple objects that possess multiple perceptually salient gradable properties.


Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers

arXiv.org Artificial Intelligence

We study the role of linguistic context in predicting quantifiers (`few', `all'). We collect crowdsourced data from human participants and test various models in a local (single-sentence) and a global context (multi-sentence) condition. Models significantly out-perform humans in the former setting and are only slightly better in the latter. While human performance improves with more linguistic context (especially on proportional quantifiers), model performance suffers. Models are very effective in exploiting lexical and morpho-syntactic patterns; humans are better at genuinely understanding the meaning of the (global) context.