AITopics | Eden, Lilach

Collaborating Authors

Eden, Lilach

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Survey on Evaluation of LLM-based Agents

Yehudai, Asaf, Eden, Lilach, Li, Alan, Uziel, Guy, Zhao, Yilun, Bar-Haim, Roy, Cohan, Arman, Shmueli-Scheuer, Michal

arXiv.org Artificial IntelligenceMar-20-2025

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.16416

Country:

Europe > Italy (0.28)
Asia > Middle East > UAE (0.14)

Genre: Overview (1.00)

Industry:

Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

JuStRank: Benchmarking LLM Judges for System Ranking

Gera, Ariel, Boni, Odellia, Perlitz, Yotam, Bar-Haim, Roy, Eden, Lilach, Yehudai, Asaf

arXiv.org Artificial IntelligenceDec-12-2024

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.09569

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies

Cattan, Arie, Hope, Tom, Downey, Doug, Bar-Haim, Roy, Eden, Lilach, Kantor, Yoav, Dagan, Ido

arXiv.org Artificial IntelligenceNov-19-2023

Various NLP tasks require a complex hierarchical structure over nodes, where each node is a cluster of items. Examples include generating entailment graphs, hierarchical cross-document coreference resolution, annotating event and subevent relations, etc. To enable efficient annotation of such hierarchical structures, we release CHAMP, an open source tool allowing to incrementally construct both clusters and hierarchy simultaneously over any type of texts. This incremental approach significantly reduces annotation time compared to the common pairwise annotation approach and also guarantees maintaining transitivity at the cluster and hierarchy levels. Furthermore, CHAMP includes a consolidation mode, where an adjudicator can easily compare multiple cluster hierarchy annotations and resolve disagreements.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2311.11301

Country:

Europe (1.00)
North America > United States > Texas (0.14)
North America > United States > Michigan (0.14)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

From Key Points to Key Point Hierarchy: Structured and Expressive Opinion Summarization

Cattan, Arie, Eden, Lilach, Kantor, Yoav, Bar-Haim, Roy

arXiv.org Artificial IntelligenceJun-6-2023

Key Point Analysis (KPA) has been recently proposed for deriving fine-grained insights from collections of textual comments. KPA extracts the main points in the data as a list of concise sentences or phrases, termed key points, and quantifies their prevalence. While key points are more expressive than word clouds and key phrases, making sense of a long, flat list of key points, which often express related ideas in varying levels of granularity, may still be challenging. To address this limitation of KPA, we introduce the task of organizing a given set of key points into a hierarchy, according to their specificity. Such hierarchies may be viewed as a novel type of Textual Entailment Graph. We develop ThinkP, a high quality benchmark dataset of key point hierarchies for business and product reviews, obtained by consolidating multiple annotations. We compare different methods for predicting pairwise relations between key points, and for inferring a hierarchy from these pairwise predictions. In particular, for the task of computing pairwise key point relations, we achieve significant gains over existing strong baselines by applying directional distributional similarity methods to a novel distributional representation of key points, and further boost performance via weak supervision.

key point, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2306.03853

Country:

Europe (1.00)
North America > United States > Michigan (0.14)
North America > United States > Maryland (0.14)
North America > United States > Louisiana (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback