Goto

Collaborating Authors

 Agents


Scaling Laws of Motion Forecasting and Planning -- Technical Report

arXiv.org Artificial Intelligence

We study the empirical scaling laws of a family of encoder-decoder autoregressive transformer models on the task of joint motion forecasting and planning in the autonomous driving domain. Using a 500 thousand hours driving dataset, we demonstrate that, similar to language modeling, model performance improves as a power-law function of the total compute budget, and we observe a strong correlation between model training loss and model evaluation metrics. Most interestingly, closed-loop metrics also improve with scaling, which has important implications for the suitability of open-loop metrics for model development and hill climbing. We also study the optimal scaling of the number of transformer parameters and the training data size for a training compute-optimal model. We find that as the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size. We also study inference-time compute scaling, where we observe that sampling and clustering the output of smaller models makes them competitive with larger models, up to a crossover point beyond which a larger models becomes more inference-compute efficient. Overall, our experimental results demonstrate that optimizing the training and inference-time scaling properties of motion forecasting and planning models is a key lever for improving their performance to address a wide variety of driving scenarios. Finally, we briefly study the utility of training on general logged driving data of other agents to improve the performance of the ego-agent, an important research area to address the scarcity of robotics data for large capacity models training.


Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

arXiv.org Artificial Intelligence

This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.


Multi-Modal Multi-Task (M3T) Federated Foundation Models for Embodied AI: Potentials and Challenges for Edge Integration

arXiv.org Artificial Intelligence

As embodied AI systems become increasingly multi-modal, personalized, and interactive, they must learn effectively from diverse sensory inputs, adapt continually to user preferences, and operate safely under resource and privacy constraints. These challenges expose a pressing need for machine learning models capable of swift, context-aware adaptation while balancing model generalization and personalization. Here, two methods emerge as suitable candidates, each offering parts of these capabilities: multi-modal multi-task foundation models (M3T-FMs) provide a pathway toward generalization across tasks and modalities, whereas federated learning (FL) offers the infrastructure for distributed, privacy-preserving model updates and user-level model personalization. However, when used in isolation, each of these approaches falls short of meeting the complex and diverse capability requirements of real-world embodied AI environments. In this vision paper, we introduce multi-modal multi-task federated foundation models (M3T-FFMs) for embodied AI, a new paradigm that unifies the strengths of M3T-FMs with the privacy-preserving distributed training nature of FL, enabling intelligent systems at the wireless edge. We collect critical deployment dimensions of M3T-FFMs in embodied AI ecosystems under a unified framework, which we name "EMBODY": Embodiment heterogeneity, Modality richness and imbalance, Bandwidth and compute constraints, On-device continual learning, Distributed control and autonomy, and Yielding safety, privacy, and personalization. For each, we identify concrete challenges and envision actionable research directions. We also present an evaluation framework for deploying M3T-FFMs in embodied AI systems, along with the associated trade-offs. Finally, we present a prototype implementation of M3T-FFMs and evaluate their energy and latency performance.


Nanobot Algorithms for Treatment of Diffuse Cancer

arXiv.org Artificial Intelligence

Motile nanosized particles, or "nanobots", promise more effective and less toxic targeted drug delivery because of their unique scale and precision. We consider the case in which the cancer is "diffuse", dispersed such that there are multiple distinct cancer sites. We investigate the problem of a swarm of nanobots locating these sites and treating them by dropping drug payloads at the sites. To improve the success of the treatment, the drug payloads must be allocated between sites according to their "demands"; this requires extra nanobot coordination. We present a mathematical model of the behavior of the nanobot agents and of their colloidal environment. This includes a movement model for agents based upon experimental findings from actual nanoparticles in which bots noisily ascend and descend chemical gradients. We present three algorithms: The first algorithm, called KM, is the most representative of reality, with agents simply following naturally existing chemical signals that surround each cancer site. The second algorithm, KMA, includes an additional chemical payload which amplifies the existing natural signals. The third algorithm, KMAR, includes another additional chemical payload which counteracts the other signals, instead inducing negative chemotaxis in agents such that they are repelled from sites that are already sufficiently treated. We present simulation results for all algorithms across different types of cancer arrangements. For KM, we show that the treatment is generally successful unless the natural chemical signals are weak, in which case the treatment progresses too slowly. For KMA, we demonstrate a significant improvement in treatment speed but a drop in eventual success, except for concentrated cancer patterns. For KMAR, our results show great performance across all types of cancer patterns, demonstrating robustness and adaptability.


VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction

arXiv.org Artificial Intelligence

Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments' complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github https://github.com/OpenMOSS/VehicleWorld.


A Survey of Generalization of Graph Anomaly Detection: From Transfer Learning to Foundation Models

arXiv.org Artificial Intelligence

School of ICT Griffith University Gold Coast, Australia junjun.pan@griffithuni.edu.au Abstract--Graph anomaly detection (GAD) has attracted increasing attention in recent years for identifying malicious samples in a wide range of graph-based applications, such as social media and e-commerce. However, most GAD methods assume identical training and testing distributions and are tailored to specific tasks, resulting in limited adaptability to real-world scenarios such as shifting data distributions and scarce training samples in new applications. T o address the limitations, recent work has focused on improving the generalization capability of GAD models through transfer learning that leverages knowledge from related domains to enhance detection performance, or developing "one-for-all" GAD foundation models that generalize across multiple applications. Since a systematic understanding of generalization in GAD is still lacking, in this paper, we provide a comprehensive review of generalization in GAD. We first trace the evolution of generalization in GAD and formalize the problem settings, which further leads to our systematic taxonomy. Rooted in this fine-grained taxonomy, an up-to-date and comprehensive review is conducted for the existing generalized GAD methods. Finally, we identify current open challenges and suggest future directions to inspire future research in this emerging field. With the advances in information technology, graph-structured data has become a ubiquitous data structure in online services, including social media [11], e-commerce [44], and autonomous agents [9], [23], [26], [34].


Several Performance Bounds on Decentralized Online Optimization are Highly Conservative and Potentially Misleading

arXiv.org Artificial Intelligence

We analyze Decentralized Online Optimization algorithms using the Performance Estimation Problem approach which allows, to automatically compute exact worst-case performance of optimization algorithms. Our analysis shows that several available performance guarantees are very conservative, sometimes by multiple orders of magnitude, and can lead to misguided choices of algorithm. Moreover, at least in terms of worst-case performance, some algorithms appear not to benefit from inter-agent communications for a significant period of time. We show how to improve classical methods by tuning their step-sizes, and find that we can save up to 20% on their actual worst-case performance regret.


Orchestrator: Active Inference for Multi-Agent Systems in Long-Horizon Tasks

arXiv.org Artificial Intelligence

Complex, non-linear tasks challenge LLM-enhanced multi-agent systems (MAS) due to partial observability and suboptimal coordination. We propose Orchestrator, a novel MAS framework that leverages attention-inspired self-emergent coordination and reflective benchmarking to optimize global task performance. Orchestrator introduces a monitoring mechanism to track agent-environment dynamics, using active inference benchmarks to optimize system behavior. By tracking agent-to-agent and agent-to-environment interaction, Orchestrator mitigates the effects of partial observability and enables agents to approximate global task solutions more efficiently. We evaluate the framework on a series of maze puzzles of increasing complexity, demonstrating its effectiveness in enhancing coordination and performance in dynamic, non-linear environments with long-horizon objectives.


AI Agents for Web Testing: A Case Study in the Wild

arXiv.org Artificial Intelligence

Automated web testing plays a critical role in ensuring high-quality user experiences and delivering business value. Traditional approaches primarily focus on code coverage and load testing, but often fall short of capturing complex user behaviors, leaving many usability issues undetected. The emergence of large language models (LLM) and AI agents opens new possibilities for web testing by enabling human-like interaction with websites and a general awareness of common usability problems. In this work, we present WebProber, a prototype AI agent-based web testing framework. Given a URL, WebProber autonomously explores the website, simulating real user interactions, identifying bugs and usability issues, and producing a human-readable report. We evaluate WebProber through a case study of 120 academic personal websites, where it uncovered 29 usability issues--many of which were missed by traditional tools. Our findings highlight agent-based testing as a promising direction while outlining directions for developing next-generation, user-centered testing frameworks.


Collective decision-making dynamics in hypernetworks

arXiv.org Artificial Intelligence

Abstract-- This work describes a collective decision-making dynamical process in a multiagent system under the assumption of cooperative higher-order interactions within the community, modeled as a hypernetwork. The nonlinear interconnected system is characterized by saturated nonlinearities that describe how agents transmit their opinion state to their neighbors in the hypernetwork, and by a bifurcation parameter representing the community's social effort. We show that the presence of higher-order interactions leads to the unfolding of a pitchfork bifurcation, introducing an interval for the social effort parameter in which the system exhibits bistability. With equilibrium points representing collective decisions, this implies that, depending on the initial conditions, the community will either remain in a deadlock state (with the origin as the equilibrium point) or reach a nontrivial decision. A numerical example is given to illustrate the results.