AITopics | bigram

Collaborating Authors

bigram

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multilingual Pretraining for Pixel Language Models

Kesen, Ilker, Lotz, Jonas F., Ziegler, Ingo, Rust, Phillip, Elliott, Desmond

arXiv.org Artificial IntelligenceDec-3-2025

Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.

artificial intelligence, natural language, pixel, (15 more...)

arXiv.org Artificial Intelligence

2505.21265

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization

Nguyen, Quan, Nguyen-Tang, Thanh

arXiv.org Artificial IntelligenceOct-22-2025

We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks -- which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss converges at linear rate to the Bayes risks. Moreover, we show that the trained transformers exhibit out-of-distribution (OOD) generalization, i.e., generalizing to samples outside of the population distribution. Our theoretical findings are further supported by extensive empirical validations, showing that \emph{without} proper parameterization, models with larger expressive power surprisingly \emph{fail} to generalize OOD after being trained by gradient descent.

artificial intelligence, exp, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2505.15009

Country: North America > Canada (0.28)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis

Pogăcean, Paul-Andrei, Avram, Sanda-Maria

arXiv.org Artificial IntelligenceJul-24-2025

The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80\% accuracy on texts shorter than 150 characters and reaches 100\% accuracy for longer texts. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.16284

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.47)

Add feedback

Language Models over Canonical Byte-Pair Encodings

Vieira, Tim, Liu, Tianyu, Pasti, Clemente, Emara, Yahya, DuSell, Brian, LeBrun, Benjamin, Giulianelli, Mario, Gastaldi, Juan Luis, O'Donnell, Timothy J., Cotterell, Ryan

arXiv.org Artificial IntelligenceJun-10-2025

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.07956

Country: North America (0.28)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Parameterized Synthetic Text Generation with SimpleStories

Finke, Lennart, Sreedhara, Chandan, Dooms, Thomas, Allen, Mat, Zhang, Emerald, Rodriguez, Juan Diego, Nabeshima, Noa, Marshall, Thomas, Braun, Dan

arXiv.org Artificial IntelligenceJun-3-2025

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.09184

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)

Add feedback

Exploration of COVID-19 Discourse on Twitter: American Politician Edition

Kim, Cindy, Puchall, Daniela, Liang, Jiangyi, Kim, Jiwon

arXiv.org Artificial IntelligenceMay-12-2025

The advent of the COVID-19 pandemic has undoubtedly affected the political scene worldwide and the introduction of new terminology and public opinions regarding the virus has further polarized partisan stances. Using a collection of tweets gathered from leading American political figures online (Republican and Democratic), we explored the partisan differences in approach, response, and attitude towards handling the international crisis. Implementation of the bag-of-words, bigram, and TF-IDF models was used to identify and analyze keywords, topics, and overall sentiments from each party. Results suggest that Democrats are more concerned with the casualties of the pandemic, and give more medical precautions and recommendations to the public whereas Republicans are more invested in political responsibilities such as keeping the public updated through media and carefully watching the progress of the virus. We propose a systematic approach to predict and distinguish a tweet's political stance (left or right leaning) based on its COVID-19 related terms using different classification algorithms on different language models.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.05687

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

Add feedback

Is analogy enough to draw novel adjective-noun inferences?

Ross, Hayley, Davidson, Kathryn, Kim, Najoung

arXiv.org Artificial IntelligenceMar-31-2025

Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.

bigram, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.24293

Country:

North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(4 more...)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.79)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)

Add feedback

Measuring Political Preferences in AI Systems: An Integrative Approach

Rozado, David

arXiv.org Artificial IntelligenceMar-3-2025

Measuring Political Preferences in AI Systems - A n Integrative Approach David Rozado Political biases in Large Language Model (LLM) - based artificial intelligence (AI) systems, such as OpenAI ' s ChatGPT or Google ' s Gemini, have been previously reported . While several prior studies have attempted to quantify these biases using political orientation tests, such approaches are limited by potential tests ' calibration biases and constrained response formats that do not reflect real - world human - AI interaction s. This study employs a multi - method approach to assess political bias in leading AI systems, integrating four complementary methodologies: (1) linguistic comparison of AI - generated text with the language used by Republican and Democratic U.S. Congress mem bers, (2) analysis of political viewpoints embedded in AI - generated policy recommendations, (3) sentiment analysis of AI - generated text toward politically affiliated public figures, and (4) standardized political orientation testing. Results indicate a con sistent left - leaning bias across most contemporary AI systems, with arguably varying degrees of intensity. However, this bias is not an inherent feature of LLMs; prior research demonstrates that fine - tuning with politically skewed data can realign these mo dels across the ideological spectrum. The presence of systematic political bias in AI systems poses risks, including reduced viewpoint diversity, increased societal polarization, and the potential for public mistrust in AI technologies. To mitigate these r isks, AI systems should be designed to prioritize factual accuracy while maintaining neutrality on most lawful normative issues. Furthermore, independent monitoring platforms are necessary to ensure transparency, accountability, and responsible AI developm ent. Introduction Recent advancements in AI technology, exemplified by Large Language Models (LLMs) like ChatGPT, represent one of the most significant technological breakthroughs in recent decades. The ability of AI systems to understand and generate human - like natural language has unlocked new possibilities for automation, human - computer interaction, content generation, and information retrieval. However, th ese impressive capabilities ha ve also raised concerns abo ut the potential biases that such systems might harbor [1], [2], [3], [4] . Preliminary evidence has suggested that AI systems exhibit political biases in the textual content they generate [2], [5], [6] .

ai system, llm, political bias, (15 more...)

arXiv.org Artificial Intelligence

2503.10649

Country:

North America > United States (1.00)
Europe > United Kingdom (0.04)
Asia > Middle East > Iraq (0.04)

Genre: Research Report (1.00)

Industry:

Media > News (1.00)
Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hebbian learning the local structure of language

Eugenio, P. Myles

arXiv.org Artificial IntelligenceMar-3-2025

Learning in the brain is local and unsupervised (Hebbian). We derive the foundations of an effective human language model inspired by these microscopic constraints. It has two parts: (1) a hierarchy of neurons which learns to tokenize words from text (whichiswhatyoudowhenyoureadthis); and (2) additional neurons which bind the learned symanticless patterns of the tokenizer into a symanticful token (an embedding). The model permits continuous parallel learning without forgetting; and is a powerful tokenizer which performs renormalization group. This allows it to exploit redundancy, such that it generates tokens which are always decomposable into a basis set (e.g an alphabet), and can mix features learned from multiple languages. We find that the structure of this model allows it to learn a natural language morphology WITHOUT data. The language data generated by this model predicts the correct distribution of word-forming patterns observed in real languages, and further demonstrates why microscopically human speech is broken up into words. This model provides the basis for understanding the microscopic origins of language and human creativity.

correlation, hierarchy, neuron, (17 more...)

arXiv.org Artificial Intelligence

2503.02057

Country:

North America > United States > Connecticut > Tolland County > Storrs (0.14)
North America > United States > Oregon > Lane County > Eugene (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Indiana > Monroe County > Bloomington (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Tokenized SAEs: Disentangling SAE Reconstructions

Dooms, Thomas, Wilhelm, Daniel

arXiv.org Artificial IntelligenceFeb-24-2025

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

activation, reconstruction, sae, (15 more...)

arXiv.org Artificial Intelligence

2502.17332

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback