AITopics | Opitz, Juri

Collaborating Authors

Opitz, Juri

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models

Li, Hongji, Michail, Andrianos, Gubelmann, Reto, Clematide, Simon, Opitz, Juri

arXiv.org Artificial IntelligenceFeb-25-2025

We propose the Sentence Smith framework that enables controlled and specified manipulation of text meaning. It consists of three main steps: 1. Parsing a sentence into a semantic graph, 2. Applying human-designed semantic manipulation rules, and 3. Generating text from the manipulated graph. A final filtering step (4.) ensures the validity of the applied transformation. To demonstrate the utility of Sentence Smith in an application study, we use it to generate hard negative pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can gain deeper insights into the specific strengths and weaknesses of widely used text embedding models, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that the generations produced by Sentence Smith are highly accurate.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.14734

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Interpretable Text Embeddings and Text Similarity Explanation: A Primer

Opitz, Juri, Möller, Lucas, Michail, Andrianos, Clematide, Simon

arXiv.org Artificial IntelligenceFeb-20-2025

Text embeddings and text embedding models are a backbone of many AI and NLP systems, particularly those involving search. However, interpretability challenges persist, especially in explaining obtained similarity scores, which is crucial for applications requiring transparency. In this paper, we give a structured overview of interpretability methods specializing in explaining those similarity scores, an emerging research area. We study the methods' individual ideas and techniques, evaluating their potential for improving interpretability of text embeddings and explaining predicted similarities.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.14862

Country:

North America > United States (0.93)
Europe > Middle East > Malta (0.14)
North America > Mexico > Mexico City (0.14)
(2 more...)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Adapting Multilingual Embedding Models to Historical Luxembourgish

Michail, Andrianos, Raclé, Corina Julia, Opitz, Juri, Clematide, Simon

arXiv.org Artificial IntelligenceFeb-11-2025

The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings. We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish, a low-resource language. We collect historical Luxembourgish news articles spanning various time periods and use GPT-4o to segment and translate them into closely related languages, creating 20,000 parallel training sentences per language pair. We further create a historical bitext mining evaluation set and find that these models struggle to perform cross-lingual search on historical Luxembourgish. To address this, we propose a simple adaptation method using in-domain training data, achieving up to 98\% accuracy in cross-lingual evaluations. We release our adapted models and historical Luxembourgish-German/French bitexts to support further research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2502.07938

Country:

North America > United States (0.14)
Europe > France (0.14)
Europe > Croatia (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice

Opitz, Juri

arXiv.org Artificial IntelligenceJul-2-2024

Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a `macro' metric. This is problematic, since picking a metric can affect research findings, and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics' underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.

machine learning, natural language, prevalence, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1162/tacl_a_00675

2404.16958

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas > Upstream (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Communications > Social Media (0.94)

Add feedback

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Opitz, Juri

arXiv.org Artificial IntelligenceMay-27-2024

The Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models, possibly without calibration. An important example of AUC application is the evaluation and benchmarking of models that predict faithfulness of generated text. But we show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application, yielding significant changes in benchmark rankings. To paint a more realistic picture of downstream model performance (and prepare a model for actual application), we explore different calibration modes, testing calibration data and method.

calibration, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.03344

Country:

North America > United States > Pennsylvania (0.14)
North America > United States > Minnesota (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Add feedback

Natural Language Processing RELIES on Linguistics

Opitz, Juri, Wein, Shira, Schneider, Nathan

arXiv.org Artificial IntelligenceMay-9-2024

Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym $RELIES$ that encapsulates six major facets where linguistics contributes to NLP: $R$esources, $E$valuation, $L$ow-resource settings, $I$nterpretability, $E$xplanation, and the $S$tudy of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-a-vis systems of human language.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2405.05966

Country:

Asia > Middle East > UAE (0.14)
North America > United States > Louisiana (0.14)
Europe > Middle East > Malta (0.14)
(8 more...)

Genre: Overview (0.93)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

On the Role of Summary Content Units in Text Summarization Evaluation

Nawrath, Marcel, Nowak, Agnieszka, Ratz, Tristan, Walenta, Danilo C., Opitz, Juri, Ribeiro, Leonardo F. R., Sedoc, João, Deutsch, Daniel, Mille, Simon, Liu, Yixin, Zhang, Lining, Gehrmann, Sebastian, Mahamood, Saad, Clinciu, Miruna, Chandu, Khyathi, Hou, Yufang

arXiv.org Artificial IntelligenceApr-2-2024

At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs are concise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages? ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategies to approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when ranking short summaries, but may not help as much when ranking systems or longer summaries.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.01701

Country:

Europe (1.00)
North America > United States > Minnesota (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.49)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

Leiter, Christoph, Opitz, Juri, Deutsch, Daniel, Gao, Yang, Dror, Rotem, Eger, Steffen

arXiv.org Artificial IntelligenceOct-30-2023

With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2310.19792

Country:

Europe (1.00)
North America > United States > Massachusetts (0.14)
Asia > Middle East > UAE (0.14)
(2 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

AMR4NLI: Interpretable and robust NLI measures from semantic graphs

Opitz, Juri, Wein, Shira, Steen, Julius, Frank, Anette, Schneider, Nathan

arXiv.org Artificial IntelligenceSep-5-2023

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.

artificial intelligence, computational linguistic, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.00936

Country:

Europe (1.00)
North America > United States > Washington, D.C. > District of Columbia > Washington (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > District of Columbia > Washington (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Gzip versus bag-of-words for text classification

Opitz, Juri

arXiv.org Artificial IntelligenceAug-8-2023

KNN is a simple classifier that uses distance measurements between data points: For a given testing point, we calculate its distance to every other point from some labeled training set and check the labels of the K closest points (i.e., the K-Neirest Neighbors), predicting the label that we observe most frequently. Hence, it is straightforward to build a general text classifier from KNN, if we can equip it with a sensible distance measure between documents. Interestingly, recent findings [4] suggest that we can exploit compression to assess the distance of two documents, by comparing their individual compression lengths to the length of their compressed concatenation (we call this measurement gzip). With this approach, [4] show strong text classification performance across different data sets, sometimes achieving higher accuracy than trained neural classifiers such as BERT [2], especially in scenarios where only few training data are available. Against this background, it is not surprising that gzip has quickly attracted lots of attention.

machine learning, natural language, training data, (20 more...)

arXiv.org Artificial Intelligence

2307.15002

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.72)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback