AITopics | Fudolig, Mikaela Irene

Collaborating Authors

Fudolig, Mikaela Irene

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Zimmerman, Julia Witte, Hudon, Denis, Cramer, Kathryn, Ruiz, Alejandro J., Beauregard, Calla, Fehr, Ashley, Fudolig, Mikaela Irene, Demarest, Bradford, Bird, Yoshi Meke, Trujillo, Milo Z., Danforth, Christopher M., Dodds, Peter Sheridan

arXiv.org Artificial IntelligenceDec-24-2024

Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being meaningfully insulated from the main system intelligence.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.10924

Country:

Asia (1.00)
North America > United States > Vermont > Chittenden County (0.28)

Genre: Research Report (1.00)

Industry:

Education (0.67)
Energy > Oil & Gas (0.45)
Law (0.45)
Health & Medicine > Therapeutic Area (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

A decomposition of book structure through ousiometric fluctuations in cumulative word-time

Fudolig, Mikaela Irene, Alshaabi, Thayer, Cramer, Kathryn, Danforth, Christopher M., Dodds, Peter Sheridan

arXiv.org Artificial IntelligenceMay-11-2023

While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative.

artificial intelligence, natural language, time sery, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1057/s41599-023-01680-4

2208.09496

Country:

Asia (0.67)
North America > United States > Vermont (0.28)

Genre: Research Report > New Finding (0.86)

Industry: Government (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback