AITopics | Dror, Rotem

Plotting

Dror, Rotem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

Calderon, Nitay, Reichart, Roi, Dror, Rotem

arXiv.org Artificial IntelligenceFeb-5-2025

The "LLM-as-a-judge" paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure -- the Alternative Annotator Test (alt-test) -- that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.

annotator, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2501.1097

Country:

North America > Mexico > Mexico City (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > UAE (0.14)
North America > United States > Hawaii (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran, Kaplan, Guy, Malkin, Dan, Dror, Rotem, Shahaf, Dafna, Stanovsky, Gabriel

arXiv.org Artificial IntelligenceJan-30-2024

Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2401.00595

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report > New Finding (0.92)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

DMLR: Data-centric Machine Learning Research -- Past, Present and Future

Oala, Luis, Maskey, Manil, Bat-Leah, Lilith, Parrish, Alicia, Gürel, Nezihe Merve, Kuo, Tzu-Sheng, Liu, Yang, Dror, Rotem, Brajovic, Danilo, Yao, Xiaozhe, Bartolo, Max, Rojas, William A Gaviria, Hileman, Ryan, Aliment, Rainier, Mahoney, Michael W., Risdal, Meg, Lease, Matthew, Samek, Wojciech, Dutta, Debojyoti, Northcutt, Curtis G, Coleman, Cody, Hancock, Braden, Koch, Bernard, Tadesse, Girmaw Abebe, Karlaš, Bojan, Alaa, Ahmed, Dieng, Adji Bousso, Noy, Natasha, Reddi, Vijay Janapa, Zou, James, Paritosh, Praveen, van der Schaar, Mihaela, Bollacker, Kurt, Aroyo, Lora, Zhang, Ce, Vanschoren, Joaquin, Guyon, Isabelle, Mattson, Peter

arXiv.org Artificial IntelligenceNov-21-2023

Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.

artificial intelligence, machine learning, university, (16 more...)

arXiv.org Artificial Intelligence

2311.13028

Country:

North America > United States > California (1.00)
Asia (1.00)
Europe > Netherlands (0.68)

Genre: Research Report (0.64)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.68)
Education > Curriculum > Subject-Specific Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

Leiter, Christoph, Opitz, Juri, Deutsch, Daniel, Gao, Yang, Dror, Rotem, Eger, Steffen

arXiv.org Artificial IntelligenceOct-30-2023

With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2310.19792

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East (0.67)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Zero-Shot On-the-Fly Event Schema Induction

Dror, Rotem, Wang, Haoyu, Roth, Dan

arXiv.org Artificial IntelligenceMar-27-2023

What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information, and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate in a series of experiments that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts while being more general and flexible without the need for a predefined ontology.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2210.06254

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.93)

Genre:

Research Report (0.82)
Workflow (0.69)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Human-in-the-Loop Schema Induction

Zhang, Tianyi, Tham, Isaac, Hou, Zhaoyi, Ren, Jiaxuan, Zhou, Liyang, Xu, Hainiu, Zhang, Li, Martin, Lara J., Dror, Rotem, Li, Sha, Ji, Heng, Palmer, Martha, Brown, Susan, Suchocki, Reece, Callison-Burch, Chris

arXiv.org Artificial IntelligenceFeb-25-2023

Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction(IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.

machine learning, natural language, node, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.acl-demo.1

2302.13048

Country: North America > United States (1.00)

Genre: Workflow (0.68)

Industry:

Health & Medicine (0.47)
Government (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback