Agents
ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
Jin, Tengjun, Zhu, Yuxuan, Kang, Daniel
Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ELT-Bench.
Ctrl-Z: Controlling AI Agents via Resampling
Bhatt, Aryan, Rushing, Cody, Kaufman, Adam, Tracy, Tyler, Georgiev, Vasil, Matolcsi, David, Khan, Akbir, Shlegeris, Buck
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.
Characterizing LLM-driven Social Network: The Chirper.ai Case
Zhu, Yiming, He, Yupeng, Haq, Ehsan-Ul, Tyson, Gareth, Hui, Pan
Large language models (LLMs) demonstrate the ability to simulate human decision-making processes, enabling their use as agents in modeling sophisticated social networks, both offline and online. Recent research has explored collective behavioral patterns and structural characteristics of LLM agents within simulated networks. However, empirical comparisons between LLM-driven and human-driven online social networks remain scarce, limiting our understanding of how LLM agents differ from human users. This paper presents a large-scale analysis of Chirper.ai, an X/Twitter-like social network entirely populated by LLM agents, comprising over 65,000 agents and 7.7 million AI-generated posts. For comparison, we collect a parallel dataset from Mastodon, a human-driven decentralized social network, with over 117,000 users and 16 million posts. We examine key differences between LLM agents and humans in posting behaviors, abusive content, and social network structures. Our findings provide critical insights into the evolving landscape of online social network analysis in the AI era, offering a comprehensive profile of LLM agents in social simulations.
Can Competition Enhance the Proficiency of Agents Powered by Large Language Models in the Realm of News-driven Time Series Forecasting?
Zhang, Yuxuan, Feng, Yangyang, Li, Daifeng, Zhang, Kexin, Chen, Junlan, Deng, Bowen
Multi-agents-based news-driven time series forecasting is considered as a potential paradigm shift in the era of large language models (LLMs). The challenge of this task lies in measuring the influences of different news events towards the fluctuations of time series. This requires agents to possess stronger abilities of innovative thinking and the identifying misleading logic. However, the existing multi-agent discussion framework has limited enhancement on time series prediction in terms of optimizing these two capabilities. Inspired by the role of competition in fostering innovation, this study embeds a competition mechanism within the multi-agent discussion to enhance agents' capability of generating innovative thoughts. Furthermore, to bolster the model's proficiency in identifying misleading information, we incorporate a fine-tuned small-scale LLM model within the reflective stage, offering auxiliary decision-making support. Experimental results confirm that the competition can boost agents' capacity for innovative thinking, which can significantly improve the performances of time series prediction. Similar to the findings of social science, the intensity of competition within this framework can influence the performances of agents, providing a new perspective for studying LLMs-based multi-agent systems.
C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation
Zhang, Xu, Liu, Zhifei, Wang, Jiahao, Zhang, Huixuan, Xu, Fan, Zhang, Junzhe, Wan, Xiaojun
Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.
PestMA: LLM-based Multi-Agent System for Informed Pest Management
Shi, Hongrui, Li, Shunbao, Yuan, Zhipeng, Yang, Po
Effective pest management is complex due to the need for accurate, context-specific decisions. Recent advancements in large language models (LLMs) open new possibilities for addressing these challenges by providing sophisticated, adaptive knowledge acquisition and reasoning. However, existing LLM-based pest management approaches often rely on a single-agent paradigm, which can limit their capacity to incorporate diverse external information, engage in systematic validation, and address complex, threshold-driven decisions. To overcome these limitations, we introduce PestMA, an LLM-based multi-agent system (MAS) designed to generate reliable and evidence-based pest management advice. Building on an editorial paradigm, PestMA features three specialized agents, an Editor for synthesizing pest management recommendations, a Retriever for gathering relevant external data, and a Validator for ensuring correctness. Evaluations on real-world pest scenarios demonstrate that PestMA achieves an initial accuracy of 86.8% for pest management decisions, which increases to 92.6% after validation. These results underscore the value of collaborative agent-based workflows in refining and validating decisions, highlighting the potential of LLM-based multi-agent systems to automate and enhance pest management processes.
Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning
While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.
Adapting Robot's Explanation for Failures Based on Observed Human Behavior in Human-Robot Collaboration
Naoum, Andreas, Khanna, Parag, Yadollahi, Elmira, Bjรถrkman, Mรฅrten, Smith, Christian
Adapting Robot's Explanation for Failures Based on Observed Human Behavior in Human-Robot Collaboration Andreas Naoum 1, Parag Khanna 1, Elmira Y adollahi 2, M arten Bj orkman 1, and Christian Smith 1 Abstract -- This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset [1] that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study [2], we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior . The promising results from this evaluation indicate the potential of this research in adapting a robot's explanations for failures to enhance the collaborative experience. I NTRODUCTION Recent advancements in robotics have enabled robots to collaborate with humans in a variety of tasks [3]. However, the uncertain nature of environments in which robots operate often leads to failures [4], [5]. In instances where robots encounter failures, human intervention in certain cases can easily troubleshoot the problem efficiently and effectively [2], [4], [6]. Therefore, a crucial aspect of this collaboration is the robot's ability to communicate when a failure occurs, explain why the failure happened, and, if possible, suggest a course of action for resolution. This communication ability is not only essential for successful collaboration but also for building rapport and trust [7], [8] in robots.
Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference
Matsui, Yuta, Yamaki, Ryosuke, Ueda, Ryo, Shinagawa, Seitaro, Taniguchi, Tadahiro
We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
Unification of Consensus-Based Multi-Objective Optimization and Multi-Robot Path Planning
Wozniak Abstract --Multi-agent systems seeking consensus may also have other objective functions to optimize, requiring the research of multi-objective optimization in consensus. Several recent publications have explored this domain using various methods such as weighted-sum optimization and penalization methods. This paper reviews the state of the art for consensus-based multi-objective optimization, poses a multi-agent lunar rover exploration problem seeking consensus and maximization of explored area, and achieves optimal edge weights and steering angles by applying SQP algorithms. I NTRODUCTION AND M OTIVATION A. Background Lunar exploration is an increasingly relevant pursuit in the modern space era. The four phases of Space Development Theory (SDT) are exploration, expansion, exploitation, and exclusion [1]. For private and government-backed space entities alike, all four phases of space development are intertwined with pursuing a long-term presence on the moon. Establishing this presence can enhance the United States' economic position by achieving a net-positive economic benefit from the resources offered by the Moon and beyond. Several autonomy & control challenges are associated with the establishment of an enduring presence on the moon. Autonomy is especially relevant because unmanned exploration offers increased efficiency, enabling cooperative completion of exploration without continuous human intervention. This importance is evidenced by NASA's pursuit of a cooperative trio of rovers that can cooperate without direct input from mission controllers [2]. To this end, further research in autonomous algorithms for unmanned rovers would prove worthwhile for future exploration. The assembly of a rover formation without continuous human input can be made possible by the alignment problem.