Goto

Collaborating Authors

 caller


InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Zhu, Zhenghao, Song, Yuanfeng, Chen, Xin, Liu, Chengzhong, Cui, Yakun, Cao, Caleb Chen, Han, Sirui, Guo, Yike

arXiv.org Artificial Intelligence

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi-agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high-quality insight benchmark. Regarding this, we develop a data-curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.


Towards Leveraging Sequential Structure in Animal Vocalizations

Sarkar, Eklavya, -Doss, Mathew Magimai.

arXiv.org Artificial Intelligence

Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.


DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services

Li, Xiang, Yu, Huizi, Wang, Wenkong, Wu, Yiran, Zhou, Jiayan, Hua, Wenyue, Lin, Xinxin, Tan, Wenjia, Zhu, Lexuan, Chen, Bingyi, Chen, Guang, Chen, Ming-Li, Zhou, Yang, Li, Zhao, Assimes, Themistocles L., Zhang, Yongfeng, Wu, Qingyun, Ma, Xin, Li, Lingyao, Fan, Lizhou

arXiv.org Artificial Intelligence

Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.


Apple's Best New iOS 26 Feature Has Been on Pixel Phones for Years

WIRED

Apple's Best New iOS 26 Feature Has Been on Pixel Phones for Years The iPhone's new software screens your calls using machine intelligence. Neat, but Google had the feature first--just like so many other features that rely on AI to work. Call Screening on an iPhone. Ever since I was a child, I've despised answering the phone when an unknown number calls. Who could be on the other end?


British 999 caller's voice cloned by Russian network using AI

BBC News

A BBC Verify investigation has revealed that the identities of British public sector workers have been cloned using AI by a Russian-linked disinformation campaign. The BBC's Olga Robinson has tracked down and spoken to an emergency medical advisor from Preston in England, who was shocked to learn his voice had been faked in a video campaign spreading fear ahead of Poland's presidential election earlier this year.


LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration

Chen, Zirong, An, Ziyan, Reynolds, Jennifer, Mullen, Kristin, Martini, Stephen, Ma, Meiyi

arXiv.org Artificial Intelligence

Emergency response services are critical to public safety, with 9-1-1 call-takers playing a key role in ensuring timely and effective emergency operations. To ensure call-taking performance consistency, quality assurance is implemented to evaluate and refine call-takers' skillsets. However, traditional human-led evaluations struggle with high call volumes, leading to low coverage and delayed assessments. We introduce LogiDebrief, an AI-driven framework that automates traditional 9-1-1 call debriefing by integrating Signal-Temporal Logic (STL) with Large Language Models (LLMs) for fully-covered rigorous performance evaluation. LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines. It employs a three-step verification process: (1) contextual understanding to identify responder types, incident classifications, and critical conditions; (2) STL-based runtime checking with LLM integration to ensure compliance; and (3) automated aggregation of results into quality assurance reports. Beyond its technical contributions, LogiDebrief has demonstrated real-world impact. Successfully deployed at Metro Nashville Department of Emergency Communications, it has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement. Empirical evaluation with real-world data confirms its accuracy, while a case study and extensive user study highlight its effectiveness in enhancing call-taking performance.


Performant LLM Agentic Framework for Conversational AI

Casella, Alex, Wang, Wayne

arXiv.org Artificial Intelligence

The rise of Agentic applications and automation in the Voice AI industry has led to an increased reliance on Large Language Models (LLMs) to navigate graph-based logic workflows composed of nodes and edges. However, existing methods face challenges such as alignment errors in complex workflows and hallucinations caused by excessive context size. To address these limitations, we introduce the Performant Agentic Framework (PAF), a novel system that assists LLMs in selecting appropriate nodes and executing actions in order when traversing complex graphs. PAF combines LLM-based reasoning with a mathematically grounded vector scoring mechanism, achieving both higher accuracy and reduced latency. Our approach dynamically balances strict adherence to predefined paths with flexible node jumps to handle various user inputs efficiently. Experiments demonstrate that PAF significantly outperforms baseline methods, paving the way for scalable, real-time Conversational AI systems in complex business environments.


Multi Agent based Medical Assistant for Edge Devices

Gawade, Sakharam, Akhouri, Shivam, Kulkarni, Chinmay, Samant, Jagdish, Sahu, Pragya, Aastik, null, Pahal, Jai, Meher, Saswat

arXiv.org Artificial Intelligence

Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi-agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task-specific agents to optimize resources, ensure scalability and high performance. Our proposed system acts as a one-stop solution for health care needs with features like appointment booking, health monitoring, medication reminders, and daily health reporting. Powered by the Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an average RougeL score of 85.5 for planning and 96.5 for calling for our tasks while being lightweight for on-device deployment. This innovative approach combines the benefits of ondevice systems with multi-agent architectures, paving the way for user-centric healthcare solutions.


A formal implementation of Behavior Trees to act in robotics

Ingrand, Felix

arXiv.org Artificial Intelligence

Behavior Trees (BT) are becoming quite popular as an Acting component of autonomous robotic systems. We propose to define a formal semantics to BT by translating them to a formal language which enables us to perform verification of programs written with BT, as well as runtime verification while these BT execute. This allows us to formally verify BT correctness without requiring BT programmers to master formal language and without compromising BT most valuable features: modularity, flexibility and reusability. We present the formal framework we use: Fiacre, its langage and the produced TTS model; Tina, its model checking tools and Hippo, its runtime verification engine. We then show how the translation from BT to Fiacre is automatically done, the type of formal LTL and CTL properties we can check offline and how to execute the formal model online in place of a regular BT engine. We illustrate our approach on two robotics applications, and show how BT could benefit of other features available in the Fiacre formal framework (state variables, time, etc).


Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with an LLM-Enabled Simulation

Chen, Zirong, Chason, Elizabeth, Mladenovski, Noah, Wilson, Erin, Mullen, Kristin, Martini, Stephen, Ma, Meiyi

arXiv.org Artificial Intelligence

Emergency response services are vital for enhancing public safety by safeguarding the environment, property, and human lives. As frontline members of these services, 9-1-1 dispatchers have a direct impact on response times and the overall effectiveness of emergency operations. However, traditional dispatcher training methods, which rely on role-playing by experienced personnel, are labor-intensive, time-consuming, and often neglect the specific needs of underserved communities. To address these challenges, we introduce Sim911, the first training simulation for 9-1-1 dispatchers powered by Large Language Models (LLMs). Sim911 enhances training through three key technical innovations: (1) knowledge construction, which utilizes archived 9-1-1 call data to generate simulations that closely mirror real-world scenarios; (2) context-aware controlled generation, which employs dynamic prompts and vector bases to ensure that LLM behavior aligns with training objectives; and (3) validation with looped correction, which filters out low-quality responses and refines the system performance.