Goto

Collaborating Authors

 Agents


The Most-Cited Computer Scientist Has a Plan to Make AI More Trustworthy

TIME - Tech

On June 3, Yoshua Bengio, the world's most-cited computer scientist, announced the launch of LawZero, a nonprofit that aims to create "safe by design" AI by pursuing a fundamentally different approach to major tech companies. Players like OpenAI and Google are investing heavily in AI agents--systems that not only answer queries and generate images, but can craft plans and take actions in the world. The goal of these companies is to create virtual employees that can do practically any job a human can, known in the tech industry as artificial general intelligence, or AGI. Executives like Google DeepMind's CEO Demis Hassabis point to AGI's potential to solve climate change or cure disease as a motivator for its development. Bengio, however, says we don't need agentic systems to reap AI's rewards--it's a false choice.


EXP-Bench: Can AI Conduct AI Research Experiments?

arXiv.org Artificial Intelligence

Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.


Geometry Meets Incentives: Sample-Efficient Incentivized Exploration with Linear Contexts

arXiv.org Artificial Intelligence

In the incentivized exploration model, a principal aims to explore and learn over time by interacting with a sequence of self-interested agents. It has been recently understood that the main challenge in designing incentive-compatible algorithms for this problem is to gather a moderate amount of initial data, after which one can obtain near-optimal regret via posterior sampling. With high-dimensional contexts, however, this \emph{initial exploration} phase requires exponential sample complexity in some cases, which prevents efficient learning unless initial data can be acquired exogenously. We show that these barriers to exploration disappear under mild geometric conditions on the set of available actions, in which case incentive-compatibility does not preclude regret-optimality. Namely, we consider the linear bandit model with actions in the Euclidean unit ball, and give an incentive-compatible exploration algorithm with sample complexity that scales polynomially with the dimension and other parameters.


Social Cooperation in Conversational AI Agents

arXiv.org Artificial Intelligence

The development of AI agents based on large, open-domain language models (LLMs) has paved the way for the development of general-purpose AI assistants that can support human in tasks such as writing, coding, graphic design, and scientific research. A major challenge with such agents is that, by necessity, they are trained by observing relatively short-term interactions with humans. Such models can fail to generalize to long-term interactions, for example, interactions where a user has repeatedly corrected mistakes on the part of the agent. In this work, we argue that these challenges can be overcome by explicitly modeling humans' social intelligence, that is, their ability to build and maintain long-term relationships with other agents whose behavior cannot always be predicted. By mathematically modeling the strategies humans use to communicate and reason about one another over long periods of time, we may be able to derive new game theoretic objectives against which LLMs and future AI agents may be optimized.


SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

arXiv.org Artificial Intelligence

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Retrieval (CLIR), highlights the inadequacy of existing monolingual datasets. Although some code-switching datasets exist, most are limited to bilingual mixing within homogeneous ethnic groups, leaving a critical need for a large-scale, diverse benchmark akin to ImageNet in computer vision. To bridge this gap, we introduce \textbf{LinguaMaster}, a multi-agent collaboration framework specifically designed for efficient and scalable multilingual data synthesis. Leveraging this framework, we curate \textbf{SwitchLingua}, the first large-scale multilingual and multi-ethnic code-switching dataset, including: (1) 420K CS textual samples across 12 languages, and (2) over 80 hours of audio recordings from 174 speakers representing 18 countries/regions and 63 racial/ethnic backgrounds, based on the textual data. This dataset captures rich linguistic and cultural diversity, offering a foundational resource for advancing multilingual and multicultural research. Furthermore, to address the issue that existing ASR evaluation metrics lack sensitivity to code-switching scenarios, we propose the \textbf{Semantic-Aware Error Rate (SAER)}, a novel evaluation metric that incorporates semantic information, providing a more accurate and context-aware assessment of system performance.


Risks of AI-driven product development and strategies for their mitigation

arXiv.org Artificial Intelligence

Humanity is progressing towards automated product development, a trend that promises faster creation of better products and thus the acceleration of technological progress. However, increasing reliance on non-human agents for this process introduces many risks. This perspective aims to initiate a discussion on these risks and appropriate mitigation strategies. To this end, we outline a set of principles for safer AI-driven product development which emphasize human oversight, accountability, and explainable design, among others. The risk assessment covers both technical risks which affect product quality and safety, and sociotechnical risks which affect society. While AI-driven product development is still in its early stages, this discussion will help balance its opportunities and risks without delaying essential progress in understanding, norm-setting, and regulation.


Emerging ML-AI Techniques for Analog and RF EDA

arXiv.org Artificial Intelligence

This survey explores the integration of machine learning (ML) into EDA workflows for analog and RF circuits, addressing challenges unique to analog design, which include complex constraints, nonlinear design spaces, and high computational costs. State-of-the-art learning and optimization techniques are reviewed for circuit tasks such as constraint formulation, topology generation, device modeling, sizing, placement, and routing. The survey highlights the capability of ML to enhance automation, improve design quality, and reduce time-to-market while meeting the target specifications of an analog or RF circuit. Emerging trends and cross-cutting challenges, including robustness to variations and considerations of interconnect parasitics, are also discussed.


Human-AI Governance (HAIG): A Trust-Utility Approach

arXiv.org Artificial Intelligence

This paper introduces the HAIG framework for analysing trust dynamics across evolving human-AI relationships. Current categorical frameworks (e.g., "human-in-the-loop" models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi-agent systems exhibit autonomous goal-setting behaviours. As systems advance, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories, though progression may include both gradual shifts and significant step changes. The HAIG framework operates across three levels: dimensions (Decision Authority Distribution, Process Autonomy, and Accountability Configuration), continua (gradual shifts along each dimension), and thresholds (critical points requiring governance adaptation). Unlike risk-based or principle-based approaches, HAIG adopts a trust-utility orientation, focusing on maintaining appropriate trust relationships that maximise utility while ensuring sufficient safeguards. Our analysis reveals how technical advances in self-supervision, reasoning authority, and distributed decision-making drive non-uniform trust evolution across both contextual variation and technological advancement. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for alternative approaches that anticipate governance challenges before they emerge.


Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

arXiv.org Artificial Intelligence

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution


Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

arXiv.org Artificial Intelligence

Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.