AITopics | agent response

Collaborating Authors

agent response

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs

Estornell, Andrew, Ton, Jean-Francois, Taufiq, Muhammad Faaiz, Li, Hang

arXiv.org Artificial IntelligenceJul-15-2025

Large Language Models (LLMs) have achieved strong performance on a wide range of complex reasoning tasks, yet further gains are often possible by leveraging the complementary strengths of multiple models. While multi-agent frameworks can improve solution quality by leveraging multiple LLMs, existing methods are often computationally expensive, both at training and inference time. In this work, we introduce a hierarchical multi-agent framework that addresses these challenges by training only a single leader LLM to coordinate a team of untrained peer agents. To this end, we propose Multi-agent guided Leader Policy \textbf{O}ptimization (MLPO), a novel approach which trains the leader to evaluate and synthesize agent responses without auxiliary value networks or explicit agent feedback. Leaders trained with MLPO exhibit improved performance not only when interacting with the agent team at inference time, but also enjoy improved performance when deployed in single-agent settings without the team. Empirical results on Big-Bench Hard (BBH), MATH, and MMLU demonstrate that our framework achieves substantial performance improvements over both single-agent and multi-agent baselines. Our results highlight the effectiveness and efficiency of training a single, flexible leader for collaborative reasoning in multi-agent LLM systems.

arxiv preprint arxiv, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.0896

Country: Asia (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Alkhouli, Tamer, Margatina, Katerina, Gung, James, Shu, Raphael, Zaghi, Claudia, Sunkara, Monica, Zhang, Yi

arXiv.org Artificial IntelligenceJun-3-2025

We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models on CONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.01859

Country:

Asia (0.28)
North America (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

How Strategic Agents Respond: Comparing Analytical Models with LLM-Generated Responses in Strategic Classification

Xie, Tian, Rauch, Pavan, Zhang, Xueru

arXiv.org Artificial IntelligenceJan-19-2025

When machine learning (ML) algorithms are used to automate human-related decisions, human agents may gain knowledge of the decision policy and behave strategically to obtain desirable outcomes. Strategic Classification (SC) has been proposed to address the interplay between agents and decision-makers. Prior work on SC has relied on assumptions that agents are perfectly or approximately rational, responding to decision policies by maximizing their utilities. Verifying these assumptions is challenging due to the difficulty of collecting real-world agent responses. Meanwhile, the growing adoption of large language models (LLMs) makes it increasingly likely that human agents in SC settings will seek advice from these tools. We propose using strategic advice generated by LLMs to simulate human agent responses in SC. Specifically, we examine five critical SC scenarios -- hiring, loan applications, school admissions, personal income, and public assistance programs -- and simulate how human agents with diverse profiles seek advice from LLMs. We then compare the resulting agent responses with the best responses generated by existing theoretical models. Our findings reveal that: (i) LLMs and theoretical models generally lead to agent score or qualification changes in the same direction across most settings, with both achieving similar levels of fairness; (ii) state-of-the-art commercial LLMs (e.g., GPT-3.5, GPT-4) consistently provide helpful suggestions, though these suggestions typically do not result in maximal score or qualification improvements; and (iii) LLMs tend to produce more diverse agent responses, often favoring more balanced effort allocation strategies. These results suggest that theoretical models align with LLMs to some extent and that leveraging LLMs to simulate more realistic agent responses offers a promising approach to designing trustworthy ML systems.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.16355

Country:

Asia > Middle East > Israel > Southern District > Eilat (0.04)
North America > United States > Ohio (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Setting > Higher Education (0.49)
Banking & Finance > Loans (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Chi, Jianfeng, Karn, Ujjwal, Zhan, Hongyuan, Smith, Eric, Rando, Javier, Zhang, Yiming, Plawiak, Kate, Coudert, Zacharie Delpierre, Upasani, Kartikeya, Pasupuleti, Mahesh

arXiv.org Artificial IntelligenceNov-15-2024

The past few years have witnessed an unprecedented improvement in the capabilities of Large Language Models (LLMs), driven by the success in scaling up autoregressive language modeling in terms of data, model size, and the amount of compute used for training (Kaplan et al., 2020). LLMs have demonstrated exceptional linguistic abilities (Brown, 2020; Achiam et al., 2023), general tool use (Schick et al., 2024; Cai et al., 2023), and commonsense reasoning (Wei et al., 2022; OpenAI, 2024), among other impressive capabilities. The success of LLMs as general-purpose assistants motivates research and development to extend instruction-tuning to the vision-language multimodal space (Liu et al., 2023; Gemini Team, 2023). These vision-language multimodal models, which can process and generate both text and images, also achieve human-expert performance on a wide range of tasks, such as (document) visual question answering (Antol et al., 2015; Mathew et al., 2021), image captioning (Lin et al., 2014), and image-text retrieval (Plummer et al., 2015). While these vision-language multimodal models hold tremendous promise for many applications, they should be used along with proper system guardrails to ensure safe and responsible deployment, because they can generate or propagate harmful content when interacting with online users. However, most existing guardrails (Inan et al., 2023; Llama Team, 2024b,a; Yuan et al., 2024; Ghosh et al., 2024) for the interaction (e.g., conversation) between humans and AI agents are text-only: conversation data involving other modalities, such as images, cannot be used as inputs for such guardrails. This calls for a safeguard tool for classifying safety risks in prompts and responses for conversations with multimodal contents involved. In this work, we introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both mutimodal LLM inputs (prompt classification) and mutimodal LLM responses (response classification). Unlike text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts.

large language model, llama guard 3, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2411.10414

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)

Genre: Research Report (0.82)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

Madmoni, Lior, Zait, Amir, Labzovsky, Ilia, Karmon, Danny

arXiv.org Artificial IntelligenceSep-22-2024

Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., "design a vegetarian meal plan below 1800 calories". Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent's response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint's satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is "satisfied". Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.

arxiv preprint arxiv, constraint, dataset, (14 more...)

arXiv.org Artificial Intelligence

2409.14371

Country:

North America > United States > Nevada > Clark County > Las Vegas (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Consumer Health (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

Lee, Young-Suk, Gunasekara, Chulaka, Contractor, Danish, Astudillo, Ramón Fernandez, Florian, Radu

arXiv.org Artificial IntelligenceSep-17-2024

For multi-document grounded dialog generation, As instruction-tuned language models have proven user queries and agent answers are based on top-k highly effective to generalize to new tasks, (Chung retrieved passages. In particular, we generate an et al., 2022; Wei et al., 2021; Ouyang et al., 2022; initial user query from a single document source Mishra et al., 2022; Wang et al., 2022b), there has and generate the agent answer from top-k passages been growing interest to acquire synthetic data sets retrieved on the initial user query. Subsequent generated from pre-trained language models with a user queries and all agent answers are grounded minimal or no human supervision, (Honovich et al., on the retrieved passages and dialog history. We 2022; Wang et al., 2023; Xu et al., 2023; Lee et al., use a series of carefully designed prompts to ensure 2023). While there has been an exploration of synthetic generated agent answers continue to remain data generation for persona-grounded dialog meaningful in the presence of retrieved passages, generation (Jang et al., 2022; Bao et al., 2023), often noisier than human generated documents.

dialog, information, query, (16 more...)

arXiv.org Artificial Intelligence

2409.115

Country:

North America > Dominican Republic (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry:

Banking & Finance > Trading (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PersonaGym: Evaluating Persona Agents and LLMs

Samuel, Vinay, Zou, Henry Peng, Zhou, Yue, Chaudhari, Shreyas, Kalyan, Ashwin, Rajpurohit, Tanmay, Deshpande, Ameet, Narasimhan, Karthik, Murahari, Vishvak

arXiv.org Artificial IntelligenceJul-28-2024

Persona agents, which are LLM agents that act according to an assigned persona, have demonstrated impressive contextual response capabilities across various applications. These persona agents offer significant enhancements across diverse sectors, such as education, healthcare, and entertainment, where model developers can align agent responses to different user requirements thereby broadening the scope of agent applications. However, evaluating persona agent performance is incredibly challenging due to the complexity of assessing persona adherence in free-form interactions across various environments that are relevant to each persona agent. We introduce PersonaGym, the first dynamic evaluation framework for assessing persona agents, and PersonaScore, the first automated human-aligned metric grounded in decision theory for comprehensive large-scale evaluation of persona agents. Our evaluation of 6 open and closed-source LLMs, using a benchmark encompassing 200 personas and 10,000 questions, reveals significant opportunities for advancement in persona agent capabilities across state-of-the-art models. For example, Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore than GPT 3.5 despite being a much more advanced model. Importantly, we find that increased model size and complexity do not necessarily imply enhanced persona agent capabilities thereby highlighting the pressing need for algorithmic and architectural invention towards faithful and performant persona agents.

agent, persona, persona agent, (15 more...)

arXiv.org Artificial Intelligence

2407.18416

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
South America > Brazil (0.04)
North America > United States > New York (0.04)
(20 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Non-linear Welfare-Aware Strategic Learning

Xie, Tian, Zhang, Xueru

arXiv.org Artificial IntelligenceMay-2-2024

This paper studies algorithmic decision-making in the presence of strategic individual behaviors, where an ML model is used to make decisions about human agents and the latter can adapt their behavior strategically to improve their future data. Existing results on strategic learning have largely focused on the linear setting where agents with linear labeling functions best respond to a (noisy) linear decision policy. Instead, this work focuses on general non-linear settings where agents respond to the decision policy with only "local information" of the policy. Moreover, we simultaneously consider the objectives of maximizing decision-maker welfare (model prediction accuracy), social welfare (agent improvement caused by strategic behaviors), and agent welfare (the extent that ML underestimates the agents). We first generalize the agent best response model in previous works to the non-linear setting, then reveal the compatibility of welfare objectives. We show the three welfare can attain the optimum simultaneously only under restrictive conditions which are challenging to achieve in non-linear settings. The theoretical results imply that existing works solely maximizing the welfare of a subset of parties inevitably diminish the welfare of the others. We thus claim the necessity of balancing the welfare of each party in non-linear settings and propose an irreducible optimization algorithm suitable for general strategic learning. Experiments on synthetic and real data validate the proposed algorithm.

agent, social welfare, welfare, (15 more...)

arXiv.org Artificial Intelligence

2405.0181

Country:

Asia > Middle East > Israel > Southern District > Eilat (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
North America > United States > Ohio (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Banking & Finance (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Self-Refinement of Language Models from External Proxy Metrics Feedback

Ramji, Keshav, Lee, Young-Suk, Astudillo, Ramón Fernandez, Sultan, Md Arafat, Naseem, Tahira, Munawar, Asim, Florian, Radu, Roukos, Salim

arXiv.org Artificial IntelligenceFeb-27-2024

It is often desirable for Large Language Models (LLMs) to capture multiple objectives when providing a response. In document-grounded response generation, for example, agent responses are expected to be relevant to a user's query while also being grounded in a given document. In this paper, we introduce Proxy Metric-based Self-Refinement (ProMiSe), which enables an LLM to refine its own initial response along key dimensions of quality guided by external metrics feedback, yielding an overall better final response. ProMiSe leverages feedback on response quality through principle-specific proxy metrics, and iteratively refines its response one principle at a time. We apply ProMiSe to open source language models Flan-T5-XXL and Llama-2-13B-Chat, to evaluate its performance on document-grounded question answering datasets, MultiDoc2Dial and QuAC, demonstrating that self-refinement improves response quality. We further show that fine-tuning Llama-2-13B-Chat on the synthetic dialogue data generated by ProMiSe yields significant performance improvements over the zero-shot baseline as well as a supervised fine-tuned model on human annotated data.

proxy metric, refinement, threshold, (12 more...)

arXiv.org Artificial Intelligence

2403.00827

Country:

North America > United States > Utah (0.04)
North America > United States > Pennsylvania (0.04)
North America > Dominican Republic (0.04)

Genre: Research Report (0.40)

Industry: Government (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI

Clarke, Christopher, Krishnamurthy, Karthik, Talamonti, Walter, Kang, Yiping, Tang, Lingjia, Mars, Jason

arXiv.org Artificial IntelligenceJan-13-2024

Conversational agents have been gaining increasing popularity in recent years. Influenced by the widespread adoption of task-oriented agents such as Apple Siri and Amazon Alexa, these agents are being deployed into various applications to enhance user experience. Although these agents promote "ask me anything" functionality, they are typically built to focus on a single or finite set of expertise. Given that complex tasks often require more than one expertise, this results in the users needing to learn and adopt multiple agents. One approach to alleviate this is to abstract the orchestration of agents in the background. However, this removes the option of choice and flexibility, potentially harming the ability to complete tasks. In this paper, we explore these different interaction experiences (one agent for all) vs (user choice of agents) for conversational AI. We design prototypes for each, systematically evaluating their ability to facilitate task completion. Through a series of conducted user studies, we show that users have a significant preference for abstracting agent orchestration in both system usability and system performance. Additionally, we demonstrate that this mode of interaction is able to provide quality responses that are rated within 1% of human-selected answers.

agent, participant, query, (15 more...)

arXiv.org Artificial Intelligence

2401.07123

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > The Bahamas (0.14)
(12 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.67)

Industry:

Consumer Products & Services (0.93)
Information Technology > Services (0.66)
Automobiles & Trucks > Manufacturer (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)

Add feedback