AITopics

Country:

Asia > Middle East > Israel (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-16-2026, 00:20:05 GMT

Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.

bioinformatics, large language model, machine learning, (21 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(3 more...)

Neural Information Processing SystemsNov-20-2025, 04:17:01 GMT

cdf00c97c0cb2cc35179f03363da6c4f-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Neural Information Processing SystemsOct-10-2025, 16:59:43 GMT

cdf00c97c0cb2cc35179f03363da6c4f-Paper-Conference.pdf

arxiv preprint arxiv, tokenizer, vocabulary size, (14 more...)

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Neural Information Processing SystemsOct-10-2025, 10:29:58 GMT

970cbc9be823f381392db75c0b00e17c-Paper-Conference.pdf

distillation, foundation model, representation, (14 more...)

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > New York (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsOct-10-2025, 06:48:05 GMT

79af547fa22cdcb0facd0b31dcd4bdb0-Paper-Conference.pdf

basic unit, mxdna, sequence, (15 more...)

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
(2 more...)

Bayram, M. Ali, Fincan, Ali Arda, Gümüş, Ahmet Semih, Karakaş, Sercan, Diri, Banu, Yıldırım, Savaş, Çelik, Demircan

Tokens with Meaning: A Hybrid Tokenization Approach for NLP

arXiv.org Artificial IntelligenceAug-21-2025

Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabı), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.

large language model, machine learning, natural language, (22 more...)

2508.14292

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
(2 more...)

arXiv.org Artificial IntelligenceJun-16-2025

Incorporating Domain Knowledge into Materials Tokenization

Oh, Yerim, Park, Jun-Hyung, Kim, Junho, Kim, SungHo, Lee, SangKeun

While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4\%$ and $2\%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER

artificial intelligence, natural language, text processing, (17 more...)

2506.11115

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (0.93)
Health & Medicine (0.67)
Materials > Chemicals > Commodity Chemicals (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

arXiv.org Artificial IntelligenceMar-28-2025

Tokenization of Gaze Data

Rolff, Tim, Karimian, Jurik, Hypki, Niklas, Schmidt, Susanne, Lappe, Markus, Steinicke, Frank

A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~\cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.

large language model, machine learning, natural language, (17 more...)

2503.22145

Country:

Europe > Germany > North Rhine-Westphalia > Münster Region > Münster (0.14)
Europe > Germany > Hamburg (0.04)
Oceania > New Zealand > South Island > Canterbury Region > Christchurch (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Dombrowski, Alistair, Engelhardt, Beatrix, Fairbrother, Dimitri, Evidail, Henry

Contextual Morphogenesis in Large Language Models: A Novel Approach to Self-Organizing Token Representations

arXiv.org Artificial IntelligenceJan-31-2025

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The introduction of contextual morphogenesis establishes a self-organizing mechanism that restructures token boundaries based on learned contextual dependencies, allowing embeddings to evolve progressively across iterative processing steps. Empirical evaluations demonstrate that dynamically adjusted tokenization contributes to reductions in perplexity while maintaining representational stability, particularly in linguistically complex domains where static segmentation fails to capture nuanced dependencies. Computational trade-offs associated with self-organizing token structures indicate that additional processing overhead remains within feasible limits, provided that optimization strategies account for segmentation update efficiency. Comparative assessments across different linguistic corpora suggest that adaptive tokenization preserves interpretability while improving alignment with contextual cues, reinforcing the potential of morphogenetic segmentation mechanisms to refine predictive accuracy. Stability analyses confirm that evolving token structures maintain consistent segmentation behaviors across varied text distributions, ensuring that representational adaptations remain linguistically coherent. The effectiveness of contextual morphogenesis in refining structural stability and predictive performance highlights its viability as an alternative to traditional tokenization methods. Further analysis of computational efficiency considerations suggests that hybrid strategies integrating both static and dynamic segmentation techniques may offer a balanced approach to optimizing representational flexibility while maintaining inference efficiency.

large language model, machine learning, natural language, (19 more...)

2502.00301

Genre:

Research Report > Promising Solution (0.40)
Overview > Innovation (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)