AITopics | hebrew

Collaborating Authors

hebrew

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Dahan, Noam, Kidron, Omer, Stanovsky, Gabriel

arXiv.org Artificial IntelligenceNov-19-2025

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2511.14598

Country:

Europe (1.00)
Asia > Middle East > Israel (0.93)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry:

Media > News (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

Shmidman, Shaltiel, Shmidman, Avi, Koppel, Moshe

arXiv.org Artificial IntelligenceOct-24-2025

Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.20386

Country:

Europe (0.69)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GemDetox at TextDetox CLEF 2025: Enhancing a Massively Multilingual Model for Text Detoxification on Low-resource Languages

Dang, Trung Duc Anh, D'Elia, Ferdinando Pio

arXiv.org Artificial IntelligenceOct-3-2025

As social-media platforms emerge and evolve faster than the regulations meant to oversee them, automated detoxification might serve as a timely tool for moderators to enforce safe discourse at scale. We here describe our submission to the PAN 2025 Multilingual Text Detoxification Challenge, which rewrites toxic single-sentence inputs into neutral paraphrases across 15 typologically diverse languages. Building on a 12B-parameter Gemma-3 multilingual transformer, we apply parameter-efficient LoRA SFT fine-tuning and prompting techniques like few-shot and Chain-of-Thought. Our multilingual training corpus combines 3,600 human-authored parallel pairs, 21,600 machine-translated synthetic pairs, and model-generated pairs filtered by Jaccard thresholds. At inference, inputs are enriched with three LaBSE-retrieved neighbors and explicit toxic-span annotations. Evaluated via Style Transfer Accuracy, LaBSE-based semantic preservation, and xCOMET fluency, our system ranks first on high-resource and low-resource languages. Ablations show +0.081 joint score increase from few-shot examples and +0.088 from basic CoT prompting. ANOVA analysis identifies language resource status as the strongest predictor of performance ($η^2$ = 0.667, p < 0.01).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.0125

Country:

Europe (1.00)
North America > Mexico (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
(2 more...)

Add feedback

HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark

Cohen, Amir DN, Merhav, Hilla, Goldberg, Yoav, Tsarfaty, Reut

arXiv.org Artificial IntelligenceAug-5-2025

Current benchmarks for Hebrew Natural Language Processing (NLP) focus mainly on morpho-syntactic tasks, neglecting the semantic dimension of language understanding. To bridge this gap, we set out to deliver a Hebrew Machine Reading Comprehension (MRC) dataset, where MRC is to be realized as extractive Question Answering. The morphologically rich nature of Hebrew poses a challenge to this endeavor: the indeterminacy and non-transparency of span boundaries in morphologically complex forms lead to annotation inconsistencies, disagreements, and flaws in standard evaluation metrics. To remedy this, we devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics that are suitable for the morphologically rich nature of the language. Our resulting benchmark, HeQ (Hebrew QA), features 30,147 diverse question-answer pairs derived from both Hebrew Wikipedia articles and Israeli tech news. Our empirical investigation reveals that standard evaluation metrics such as F1 scores and Exact Match (EM) are not appropriate for Hebrew (and other MRLs), and we propose a relevant enhancement. In addition, our experiments show low correlation between models' performance on morpho-syntactic tasks and on MRC, which suggests that models designed for the former might underperform on semantics-heavy tasks. The development and exploration of HeQ illustrate some of the challenges MRLs pose in natural language understanding (NLU), fostering progression towards more and better NLU models for Hebrew and other MRLs.

large language model, machine learning, question answering, (22 more...)

arXiv.org Artificial Intelligence

2508.01812

Country:

Asia > Middle East > Israel (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry: Education > Assessment & Standards > Student Performance (0.61)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (0.48)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
(3 more...)

Add feedback

Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark

Smiley, David M.

arXiv.org Artificial IntelligenceJul-2-2025

Identifying parallel passages in biblical Hebrew (BH) is central to biblical scholarship for understanding intertextual relationships. Traditional methods rely on manual comparison, a labor-intensive process prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings distinguishing parallel from non-parallel passages. Using cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5 excels in parallel detection, while AlephBERT demonstrates stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.24117

Country: Asia > Middle East > Israel (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

Splintering Nonconcatenative Languages for Better Tokenization

Gazit, Bar, Shmidman, Shaltiel, Shmidman, Avi, Pinter, Yuval

arXiv.org Artificial IntelligenceMar-18-2025

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14433

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

Don't Touch My Diacritics

Gorman, Kyle, Pinter, Yuval

arXiv.org Artificial IntelligenceOct-31-2024

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.

computational linguistic, normalization form, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2410.2414

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Czechia > Prague (0.04)
(12 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

MenakBERT -- Hebrew Diacriticizer

Cohen, Ido, Gidron, Jacob, Pinto, Idan

arXiv.org Artificial IntelligenceOct-3-2024

Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.

diacritical mark, menakbert, nakdimon, (17 more...)

arXiv.org Artificial Intelligence

2410.02417

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Language-agnostic Model of Child Language Acquisition

Mahon, Louis, Abend, Omri, Berger, Uri, Demuth, Katherine, Johnson, Mark, Steedman, Mark

arXiv.org Artificial IntelligenceAug-22-2024

This work reimplements a recent semantic bootstrapping child-language acquisition model, which was originally designed for English, and trains it to learn a new language: Hebrew. The model learns from pairs of utterances and logical forms as meaning representations, and acquires both syntax and word meanings simultaneously. The results show that the model mostly transfers to Hebrew, but that a number of factors, including the richer morphology in Hebrew, makes the learning slower and less robust. This suggests that a clear direction for future work is to enable the model to leverage the similarities between different word forms.

hebrew, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2408.12254

Country:

Europe > Netherlands > South Holland > Dordrecht (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

A Language Modeling Approach to Diacritic-Free Hebrew TTS

Roth, Amit, Turetzky, Arnon, Adi, Yossi

arXiv.org Artificial IntelligenceJul-16-2024

We tackle the task of text-to-speech (TTS) in Hebrew. Traditional Hebrew contains Diacritics, which dictate the way individuals should pronounce given words, however, modern Hebrew rarely uses them. The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation and understand which phonemes to use based on the context. This imposes a fundamental challenge on TTS systems to accurately map between text-to-speech. In this work, we propose to adopt a language modeling Diacritics-Free approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer. We optimize the proposed method using in-the-wild weakly supervised data and compare it to several diacritic-based TTS systems. Results suggest the proposed method is superior to the evaluated baselines considering both content preservation and naturalness of the generated speech. Samples can be found under the following link: pages.cs.huji.ac.il/ Figure 1: A high-level overview of the the proposed method.

dataset, representation, tokenizer, (14 more...)

arXiv.org Artificial Intelligence

2407.12206

Country: Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.70)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.61)

Add feedback