AITopics | tokenizer

Collaborating Authors

tokenizer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Textually Pretrained Speech Language Models

Neural Information Processing SystemsApr-29-2026, 17:50:20 GMT

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.2

arxiv preprint arxiv, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.67)
North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

1 import bisect 2 import re

Neural Information Processing SystemsApr-24-2026, 11:28:30 GMT

In order to convert the dataset to NER format we suggest tokenizing Tweet text and utilizing the character offsets to identify mention tokens. E.g. just setting up my twttrwith offsets 19and 24, and DBpedia category as Organization, can be converted to the NERBIO format as follows: tokens, starts, ends = tokenize_with_offsets("just setting up my twttr")and then assigning Olabels to all tokens outside the phrase start and end offsets and B-ORG and I-ORG label to all tokens within the phrase offsets. This approach works as long as the tokenizer returned offsets correspond to the offset of the phrase in the original text, i.e. tokenization is non-destructive. See example code in listing 1. A system span must match a gold span exactly to be counted as correct.

machine learning, natural language, tweet, (21 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.99)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)

Add feedback

Enhancing Large Language Models through Adaptive Tokenizers

Neural Information Processing SystemsMar-22-2026, 13:21:52 GMT

Tokenizers serve as crucial interfaces between models and linguistic data, substantially influencing the efficacy and precision of large language models (LLMs). Traditional tokenization methods often rely on static frequency-based statistics and are not inherently synchronized with LLM architectures, which may limit model performance. In this study, we propose a simple but effective method to learn tokenizers specifically engineered for seamless integration with LLMs. Initiating with a broad initial vocabulary, we refine our tokenizer by monitoring changes in the model's perplexity during training, allowing for the selection of a tokenizer that is closely aligned with the model's evolving dynamics. Through iterative refinement, we develop an optimized tokenizer. Our empirical evaluations demonstrate that this adaptive approach significantly enhances accuracy compared to conventional methods, maintaining comparable vocabulary sizes and affirming its potential to improve LLM functionality.

large language model, natural language, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Zero-Shot Tokenizer Transfer

Neural Information Processing SystemsMar-20-2026, 15:01:43 GMT

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer.

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.58)

Add feedback

Image Understanding Makes for A Good Tokenizer for Image Generation

Neural Information Processing SystemsMar-19-2026, 14:01:28 GMT

Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to project images into token sequences. Currently, **pixel reconstruction** (e.g., VQGAN) dominates the training objective for image tokenizers. In contrast, our approach adopts the **feature reconstruction** objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.66)
Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback

Data Mixture Inference Attack: BPE Tokenizers Reveal Training Data Compositions

Neural Information Processing SystemsMar-18-2026, 04:54:36 GMT

The pretraining data of today's strongest language models remains opaque, even when their parameters are open-sourced.In particular, little is known about the proportions of different domains, languages, or code represented in the data. While a long line of membership inference attacks aim to identify training examples on an instance level, they do not extend easily to statistics about the corpus. In this work, we tackle a task which we call, which aims to uncover the distributional make-up of the pretraining data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered vocabulary learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first token is the most common byte pair, the second is the most common pair after merging the first token, and so on.

artificial intelligence, machine learning, proceedings, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.58)

Add feedback

ce9e92e3de2372a4b93353eb7f3dc0bd-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-19-2026, 12:00:35 GMT

crowdsourced data, dataset, pipeline, (14 more...)

Neural Information Processing Systems

Country:

Africa > Niger (0.07)
Europe > Germany > Saxony > Leipzig (0.04)
Asia > Vietnam (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Communications > Social Media > Crowdsourcing (0.31)

Add feedback

ce9e92e3de2372a4b93353eb7f3dc0bd-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-19-2026, 12:00:31 GMT

computational linguistic, corpus, dataset, (11 more...)

Neural Information Processing Systems

Country:

Europe > Slovenia (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(29 more...)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
(4 more...)

Add feedback

Image Understanding Makes for A Good Tokenizer for Image Generation Luting Wang Y ang Zhao

Neural Information Processing SystemsFeb-19-2026, 09:11:34 GMT

Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to map images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokeniz-ers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks.

artificial intelligence, machine learning, tokenizer, (16 more...)

Neural Information Processing Systems

Country: Asia > China > Zhejiang Province (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology: