AITopics | Cognetta, Marco

Collaborating Authors

Cognetta, Marco

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tokenization as Finite-State Transduction

Cognetta, Marco, Okazaki, Naoaki

arXiv.org Artificial IntelligenceOct-21-2024

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.

artificial intelligence, natural language, transducer, (19 more...)

arXiv.org Artificial Intelligence

2410.15696

Country:

North America > Canada (0.28)
Europe > Germany (0.28)
Asia > Japan (0.28)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Cognetta, Marco, Hiraoka, Tatsuya, Okazaki, Naoaki, Sennrich, Rico, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-30-2024

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.00397

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco, Zouhar, Vilém, Moon, Sangwhan, Okazaki, Naoaki

arXiv.org Artificial IntelligenceFeb-29-2024

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen. The R\'enyi efficiency is thus treated as a predictor of downstream performance (e.g., predicting BLEU for a machine translation task), without the expensive step of training multiple models with different tokenizers. Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R\'enyi efficiency alone cannot capture. We describe two variants of BPE tokenization which can arbitrarily increase R\'enyi efficiency while decreasing the downstream model performance. These counterexamples expose cases where R\'enyi efficiency fails as an intrinsic tokenization metric and thus give insight for building more accurate predictors.

artificial intelligence, machine translation, natural language, (19 more...)

arXiv.org Artificial Intelligence

2402.14614

Country:

North America > Canada (0.14)
Oceania > Australia (0.14)
Europe > Germany (0.14)
Asia > China (0.14)

Genre:

Research Report (0.64)
Summary/Review (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)

Add feedback