AITopics

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Neural Information Processing SystemsJun-17-2026, 21:51:35 GMT

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.

artificial intelligence, proceedings, subnetwork, (7 more...)

Genre: Research Report > New Finding (0.60)

Technology: Information Technology > Artificial Intelligence (1.00)

Neural Information Processing SystemsFeb-15-2026, 14:48:29 GMT

A Supplementary Analysis

To evaluate TSLD's efficiency, we detail training speeds and GPU memory consumption for various Our analysis of confidence disparity in token predictions, detailed in Section 4.2, extends beyond a In fact, this observed trend is consistently present across various GLM models. These errors are visualized using a heatmap plot (Fig. A2 top), For the OPT -6.7B model, quantization error is measured for the 5th and 15th layers. LLaMA-7B model, quantization errors are depicted for input sequence lengths of 128 and 512. From left to right: OPT -6.7B, LLaMA-7B, and LLaMA-2-7B. However, as we delve deeper into the layers of OPT -6.7B or introduce longer input sequences to LLaMA-7B, this phenomenon becomes less pronounced.

large language model, machine learning, natural language, (20 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Artificial IntelligenceDec-5-2025

Enhancing next token prediction based pre-training for jet foundation models

Birk, Joschka, Hallin, Anna, Kasieczka, Gregor, Madzharova, Nikol, Pang, Ian, Shih, David

Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next token prediction, building on the initial work of OmniJet-$α$. Instead of tokenizing particles and subsequently only using the token-ID as the model input for both the generative and the classification task, we adopt a hybrid setup, which allows us to use continuous feature vectors as model input while only using token-IDs in the next token prediction target. Secondly, we explore a combined pre-training strategy that combines masked particle modeling and generative learning objectives. Taken together, these changes greatly improve the performance in downstream classification tasks without any loss in generative performance.

artificial intelligence, machine learning, token prediction, (18 more...)

2512.04149

Country: Europe > Germany (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceDec-4-2025

PretrainZero: Reinforcement Active Pretraining

Xing, Xingrun, Fan, Zhiyuan, Lou, Jie, Li, Guoqi, Zhang, Jiajun, Zhang, Debing

Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2512.03442

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Chang, Tyler A., Bergen, Benjamin K.

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

arXiv.org Artificial IntelligenceDec-4-2025

In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.

large language model, machine learning, subnetwork, (20 more...)

2504.15471

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Gulzar, Kashaf, Wagner, Dominik, Bayerl, Sebastian P., Hönig, Florian, Bocklet, Tobias, Riedhammer, Korbinian

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

arXiv.org Artificial IntelligenceDec-3-2025

Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.

artificial intelligence, fine-tuning, speech recognition, (15 more...)

2512.02027

Country: Europe > Germany (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Jon, Josef, Bojar, Ondřej

Finetuning LLMs for EvaCun 2025 token prediction shared task

arXiv.org Artificial IntelligenceOct-20-2025

In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

large language model, natural language, prediction, (18 more...)

2510.15561

Country: Asia > Thailand (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)

arXiv.org Artificial IntelligenceOct-10-2025

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Zhang, Shaohua, Lin, Yuan, Li, Hang

The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

function token, large language model, machine learning, (20 more...)

2510.08203

Country:

Asia (0.94)
North America > United States (0.46)
Europe > United Kingdom (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-9-2025, 00:08:07 GMT

A Supplementary Analysis

To evaluate TSLD's efficiency, we detail training speeds and GPU memory consumption for various Our analysis of confidence disparity in token predictions, detailed in Section 4.2, extends beyond a In fact, this observed trend is consistently present across various GLM models. These errors are visualized using a heatmap plot (Fig. A2 top), For the OPT -6.7B model, quantization error is measured for the 5th and 15th layers. LLaMA-7B model, quantization errors are depicted for input sequence lengths of 128 and 512. From left to right: OPT -6.7B, LLaMA-7B, and LLaMA-2-7B. However, as we delve deeper into the layers of OPT -6.7B or introduce longer input sequences to LLaMA-7B, this phenomenon becomes less pronounced.

large language model, machine learning, natural language, (20 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)