AITopics | Pinter, Yuval

Collaborating Authors

Pinter, Yuval

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Splintering Nonconcatenative Languages for Better Tokenization

Gazit, Bar, Shmidman, Shaltiel, Shmidman, Avi, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-18-2025

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.

plinter 0, reduction, tokenizer, (15 more...)

arXiv.org Artificial Intelligence

2503.14433

Country:

Europe (0.93)
Asia > Middle East > Israel (0.14)
Asia > Middle East > UAE (0.14)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

Token-Level Privacy in Large Language Models

Harel, Re'em, Gilboa, Niv, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-5-2025

The use of language models as remote services requires transmitting private information to external providers, raising significant privacy concerns. This process not only risks exposing sensitive data to untrusted service providers but also leaves it vulnerable to interception by eavesdroppers. Existing privacy-preserving methods for natural language processing (NLP) interactions primarily rely on semantic similarity, overlooking the role of contextual information. In this work, we introduce dchi-stencil, a novel token-level privacy-preserving mechanism that integrates contextual and semantic information while ensuring strong privacy guarantees under the dchi differential privacy framework, achieving 2epsilon-dchi-privacy. By incorporating both semantic and contextual nuances, dchi-stencil achieves a robust balance between privacy and utility. We evaluate dchi-stencil using state-of-the-art language models and diverse datasets, achieving comparable and even better trade-off between utility and privacy compared to existing methods. This work highlights the potential of dchi-stencil to set a new standard for privacy-preserving NLP in modern, high-risk applications.

large language model, machine learning, mechanism, (22 more...)

arXiv.org Artificial Intelligence

2503.03652

Country:

Asia > Middle East (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

How Much is Enough? The Diminishing Returns of Tokenization Training Data

Reddy, Varshini, Schmidt, Craig W., Pinter, Yuval, Tanner, Chris

arXiv.org Artificial IntelligenceFeb-27-2025

Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constraints imposed by the pre-tokenization stage of tokenization. These results offer valuable insights for optimizing the tokenization process and highlight potential avenues for future research in tokenization algorithms.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.20273

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.68)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Information Types in Product Reviews

Shapira, Ori, Pinter, Yuval

arXiv.org Artificial IntelligenceFeb-20-2025

Information in text is communicated in a way that supports a goal for its reader. Product reviews, for example, contain opinions, tips, product descriptions, and many other types of information that provide both direct insights, as well as unexpected signals for downstream applications. We devise a typology of 24 communicative goals in sentences from the product review domain, and employ a zero-shot multi-label classifier that facilitates large-scale analyses of review data. In our experiments, we find that the combination of classes in the typology forecasts helpfulness and sentiment of reviews, while supplying explanations for these decisions. In addition, our typology enables analysis of review intent, effectiveness and rhetorical structure. Characterizing the types of information in reviews unlocks many opportunities for more effective consumption of this genre.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.14335

Country:

Europe (1.00)
Asia > Middle East (0.46)
Oceania > Australia (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services > e-Commerce Services (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Don't Touch My Diacritics

Gorman, Kyle, Pinter, Yuval

arXiv.org Artificial IntelligenceOct-31-2024

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.

artificial intelligence, computational linguistic, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.2414

Country:

Europe (0.94)
Asia > Middle East > Israel (0.29)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation

Kadosh, Tal, Hasabnis, Niranjan, Soundararajan, Prema, Vo, Vy A., Capota, Mihai, Ahmed, Nesreen, Pinter, Yuval, Oren, Gal

arXiv.org Artificial IntelligenceSep-23-2024

Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate the parallelization of C/C++ code using OpenMP pragmas. OMPar integrates Large Language Models (LLMs) through two key components: OMPify, which assesses loop parallelization potential, and MonoCoder-OMP, a new fine-tuned model which generates precise OpenMP pragmas. The evaluation of OMPar follows the same rigorous process applied to traditional tools like source-to-source AutoPar and ICPC compilers: (1) ensuring the generated code compiles and runs correctly in serial form, (2) assessing performance with the gradual addition of threads and corresponding physical cores, and (3) verifying and validating the correctness of the code's output. Benchmarks from HeCBench and ParEval are used to evaluate accuracy and performance. Experimental results demonstrate that OMPar significantly outperforms traditional methods, achieving higher accuracy in identifying parallelizable loops and generating efficient pragmas. Beyond accuracy, OMPar offers advantages such as the ability to work on partial or incomplete codebases and the capacity to continuously learn from new code patterns, enhancing its parallelization capabilities over time. These results underscore the potential of LLMs in revolutionizing automatic parallelization techniques, paving the way for more efficient and scalable parallel computing systems.

large language model, machine learning, parallelization, (19 more...)

arXiv.org Artificial Intelligence

2409.14771

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Protecting Privacy in Classifiers by Token Manipulation

Harel, Re'em, Elboher, Yair, Pinter, Yuval

arXiv.org Artificial IntelligenceJul-3-2024

Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2407.01334

Country:

Asia > Middle East > UAE (0.14)
Asia > Middle East > Israel (0.14)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Law > Civil Rights & Constitutional Law (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

Uzan, Omri, Schmidt, Craig W., Tanner, Chris, Pinter, Yuval

arXiv.org Artificial IntelligenceMay-31-2024

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.

artificial intelligence, computational linguistic, natural language, (16 more...)

arXiv.org Artificial Intelligence

2403.01289

Country:

Europe (1.00)
Asia > Middle East > Israel (0.14)
North America > United States > Massachusetts (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Batsuren, Khuyagbaatar, Vylomova, Ekaterina, Dankers, Verna, Delgerbaatar, Tsetsuukhei, Uzan, Omri, Pinter, Yuval, Bella, Gábor

arXiv.org Artificial IntelligenceApr-20-2024

The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

machine learning, natural language, tokenization, (16 more...)

arXiv.org Artificial Intelligence

2404.13292

Country:

North America > United States (0.46)
Europe (0.46)
Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Cognetta, Marco, Hiraoka, Tatsuya, Okazaki, Naoaki, Sennrich, Rico, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-30-2024

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2404.00397

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback