AITopics | Arnett, Catherine

Collaborating Authors

Arnett, Catherine

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Arnett, Catherine, Chang, Tyler A., Michaelov, James A., Bergen, Benjamin K.

arXiv.org Artificial IntelligenceMar-5-2025

While crosslingual transfer is crucial to contemporary language models' multilingual capabilities, how it occurs is not well understood. In this paper, we ask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.03962

Country:

Europe (0.93)
Asia (0.68)
North America > United States > Massachusetts (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine, Bergen, Benjamin K.

arXiv.org Artificial IntelligenceNov-21-2024

Language models perform differently across languages. It has been previously suggested that morphological typology may explain some of this variability (Cotterell et al., 2018). We replicate previous analyses and find additional new evidence for a performance gap between agglutinative and fusional languages, where fusional languages, such as English, tend to have better language modeling performance than morphologically more complex languages like Turkish. We then propose and test three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. To test the morphological alignment hypothesis, we present MorphScore, a tokenizer evaluation metric, and supporting datasets for 22 languages. We find some evidence that tokenization quality explains the performance gap, but none for the role of morphological alignment. Instead we find that the performance gap is most reduced when training datasets are of equivalent size across language types, but only when scaled according to the so-called "byte-premium" -- the different encoding efficiencies of different languages and orthographies. These results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology. Differences in performance can be attributed to disparities in dataset size. These results bear on ongoing efforts to improve performance for low-performing and under-resourced languages.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2411.14198

Country:

North America > United States (0.93)
Asia > Middle East (0.68)
Europe > Middle East > Malta (0.28)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Arnett, Catherine, Jones, Eliot, Yamshchikov, Ivan P., Langlais, Pierre-Carl

arXiv.org Artificial IntelligenceNov-18-2024

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.22587

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.46)
North America > United States > New York (0.28)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Consumer Products & Services (0.67)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

Michaelov, James A., Arnett, Catherine, Bergen, Benjamin K.

arXiv.org Artificial IntelligenceApr-29-2024

Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent neural network architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.

language comprehension, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2404.19178

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California > San Diego County (0.14)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.67)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Arnett, Catherine, Rivière, Pamela D., Chang, Tyler A., Trott, Sean

arXiv.org Artificial IntelligenceMar-20-2024

The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, even when induced artificially for words that would not be tokenized that way during training. We then present exploratory analyses demonstrating that language model embeddings for different plural tokenizations have similar distributions along the embedding space axis that maximally distinguishes singular and plural nouns. Our results suggest that morphologically-aligned tokenization is a viable tokenization approach, and existing models already generalize some morphological patterns to new items. However, our results indicate that morphological tokenization is not strictly required for performance.

machine learning, natural language, tokenization, (16 more...)

arXiv.org Artificial Intelligence

2403.13754

Country:

Europe > Germany (0.14)
Europe > Belgium (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages

Arnett, Catherine, Chang, Tyler A., Bergen, Benjamin K.

arXiv.org Artificial IntelligenceMar-1-2024

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2403.00686

Country:

Europe > France (0.14)
North America > Canada (0.14)
Europe > Denmark (0.14)
Europe > Belgium (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback

When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages

Chang, Tyler A., Arnett, Catherine, Tu, Zhuowen, Bergen, Benjamin K.

arXiv.org Artificial IntelligenceNov-15-2023

Multilingual language models are widely used to extend NLP systems to lowresource languages. However, concrete evidence for the effects of multilinguality on language modeling performance in individual languages remains scarce. Here, we pre-train over 10,000 monolingual and multilingual language models for over 250 languages, including multiple language families that are understudied in NLP. We assess how language modeling performance in each language varies as a function of (1) monolingual dataset size, (2) added multilingual dataset size, (3) linguistic similarity of the added languages, and (4) model size (up to 45M parameters). We find that in moderation, adding multilingual data improves low-resource language modeling performance, similar to increasing low-resource dataset sizes by up to 33%. Improvements depend on the syntactic similarity of the added multilingual data, with marginal additional effects of vocabulary overlap. However, high-resource languages consistently perform worse in multilingual pre-training scenarios. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can significantly improve performance. Multilingual language models have been a fixture of natural language processing (NLP) research nearly since the introduction of Transformer language models (Devlin et al., 2019; Conneau et al., 2020a). However, while multilingual language models produce strong results across many languages, multilingual pre-training work almost exclusively focuses on pre-training a small number of models with some fixed distribution over languages (e.g. Thus, it is largely unknown how different pre-training language distributions, such as different quantities of multilingual data or different selections of languages, affect multilingual language model performance. Fujinuma et al. (2022) vary the set of pre-training languages, but they consider only 14 variations of 14 languages, and they focus on cross-lingual transfer after English fine-tuning. Figure 1: Left: Map of the 252 languages used in our study.

computational linguistic, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2311.09205

Country:

Europe (0.93)
Asia (0.93)
North America > United States > California (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)

Add feedback

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Michaelov, James A., Arnett, Catherine, Chang, Tyler A., Bergen, Benjamin K.

arXiv.org Artificial IntelligenceNov-15-2023

Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is processed and produced more readily. Because confounds exist when using stimuli in a single language, evidence of abstraction is even more compelling from crosslingual structural priming, where use of a syntactic structure in one language primes an analogous structure in another language. We measure crosslingual structural priming in large language models, comparing model behavior to human experimental results from eight crosslingual experiments covering six languages, and four monolingual structural priming experiments in three non-English languages. We find evidence for abstract monolingual and crosslingual grammatical representations in the models that function similarly to those found in humans. These results demonstrate that grammatical representations in multilingual language models are not only similar across languages, but they can causally influence text produced in different languages.

active passive active passive, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2311.09194

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

Arnett, Catherine, Chang, Tyler A., Michaelov, James A., Bergen, Benjamin K.

arXiv.org Artificial IntelligenceOct-11-2023

Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models.

artificial intelligence, bilingual language model, crosslingual structural priming, (1 more...)

arXiv.org Artificial Intelligence

2310.07929

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.80)
Information Technology > Artificial Intelligence > Machine Learning (0.60)

Add feedback