Not enough data to create a plot.
Try a different view from the menu above.
Dror, Rotem
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
Calderon, Nitay, Reichart, Roi, Dror, Rotem
The "LLM-as-a-judge" paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure -- the Alternative Annotator Test (alt-test) -- that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
State of What Art? A Call for Multi-Prompt LLM Evaluation
Mizrahi, Moran, Kaplan, Guy, Malkin, Dan, Dror, Rotem, Shahaf, Dafna, Stanovsky, Gabriel
Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Oala, Luis, Maskey, Manil, Bat-Leah, Lilith, Parrish, Alicia, Gรผrel, Nezihe Merve, Kuo, Tzu-Sheng, Liu, Yang, Dror, Rotem, Brajovic, Danilo, Yao, Xiaozhe, Bartolo, Max, Rojas, William A Gaviria, Hileman, Ryan, Aliment, Rainier, Mahoney, Michael W., Risdal, Meg, Lease, Matthew, Samek, Wojciech, Dutta, Debojyoti, Northcutt, Curtis G, Coleman, Cody, Hancock, Braden, Koch, Bernard, Tadesse, Girmaw Abebe, Karlaลก, Bojan, Alaa, Ahmed, Dieng, Adji Bousso, Noy, Natasha, Reddi, Vijay Janapa, Zou, James, Paritosh, Praveen, van der Schaar, Mihaela, Bollacker, Kurt, Aroyo, Lora, Zhang, Ce, Vanschoren, Joaquin, Guyon, Isabelle, Mattson, Peter
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
Leiter, Christoph, Opitz, Juri, Deutsch, Daniel, Gao, Yang, Dror, Rotem, Eger, Steffen
With an increasing number of parameters and pre-training data, generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation. Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset. Notably, despite the task's restrictions, the best-performing systems achieve results on par with or even surpassing recent reference-free metrics developed using larger models, including GEMBA and Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human evaluation of the plausibility of explanations given by the LLMs.
Zero-Shot On-the-Fly Event Schema Induction
Dror, Rotem, Wang, Haoyu, Roth, Dan
What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information, and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate in a series of experiments that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts while being more general and flexible without the need for a predefined ontology.
Human-in-the-Loop Schema Induction
Zhang, Tianyi, Tham, Isaac, Hou, Zhaoyi, Ren, Jiaxuan, Zhou, Liyang, Xu, Hainiu, Zhang, Li, Martin, Lara J., Dror, Rotem, Li, Sha, Ji, Heng, Palmer, Martha, Brown, Susan, Suchocki, Reece, Callison-Burch, Chris
Schema induction builds a graph representation explaining how events unfold in a scenario. Existing approaches have been based on information retrieval (IR) and information extraction(IE), often with limited human curation. We demonstrate a human-in-the-loop schema induction system powered by GPT-3. We first describe the different modules of our system, including prompting to generate schematic elements, manual edit of those elements, and conversion of those into a schema graph. By qualitatively comparing our system to previous ones, we show that our system not only transfers to new domains more easily than previous approaches, but also reduces efforts of human curation thanks to our interactive interface.