Agents
Computer-Use Agents as Judges for Generative User Interface
Lin, Kevin Qinghong, Hu, Siyuan, Li, Linjie, Yang, Zhengyuan, Wang, Lijuan, Torr, Philip, Shou, Mike Zheng
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
Exploring the use of AI authors and reviewers at Agents4Science
Bianchi, Federico, Queen, Owen, Thakkar, Nitya, Sun, Eric, Zou, James
There is growing interest in using AI agents for scientific research, yet fundamental questions remain about their capabilities as scientists and reviewers. To explore these questions, we organized Agents4Science, the first conference in which AI agents serve as both primary authors and reviewers, with humans as co-authors and co-reviewers. Here, we discuss the key learnings from the conference and their implications for human-AI collaboration in science.
Decentralized Gaussian Process Classification and an Application in Subsea Robotics
Gao, Yifei, He, Hans J., Stilwell, Daniel J., McMahon, James
Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem -- communication events are either successful or unsuccessful -- where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.
Terra Nova: A Comprehensive Challenge Environment for Intelligent Agents
We introduce Terra Nova, a new comprehensive challenge environment (CCE) for reinforcement learning (RL) research inspired by Civilization V. A CCE is a single environment in which multiple canonical RL challenges (e.g., partial observability, credit assignment, representation learning, enormous action spaces, etc.) arise simultaneously. Mastery therefore demands integrated, long-horizon understanding across many interacting variables. We emphasize that this definition excludes challenges that only aggregate unrelated tasks in independent, parallel streams (e.g., learning to play all Atari games at once). These aggregated multitask benchmarks primarily asses whether an agent can catalog and switch among unrelated policies rather than test an agent's ability to perform deep reasoning across many interacting challenges.
Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access, Urbanization, and Carbon Emissions
Achieving Sustainable Development Goal 7 (Affordable and Clean Energy) requires not only technological innovation but also a deeper understanding of the socioeconomic factors influencing energy access and carbon emissions. While these factors are gaining attention, critical questions remain, particularly regarding how to quantify their impacts on energy systems, model their cross-domain interactions, and capture feedback dynamics in the broader context of energy transitions. To address these gaps, this study introduces ClimateAgents, an AI-based framework that combines large language models with domain-specialized agents to support hypothesis generation and scenario exploration. Leveraging 20 years of socioeconomic and emissions data from 265 economies, countries and regions, and 98 indicators drawn from the World Bank database, the framework applies a machine learning based causal inference approach to identify key determinants of carbon emissions in an evidence-based, data driven manner. The analysis highlights three primary drivers: access to clean cooking fuels in rural areas, access to clean cooking fuels in urban areas, and the percentage of population living in urban areas. These findings underscore the critical role of clean cooking technologies and urbanization patterns in shaping emission outcomes. In line with growing calls for evidence-based AI policy, ClimateAgents offers a modular and reflexive learning system that supports the generation of credible and actionable insights for policy. By integrating heterogeneous data modalities, including structured indicators, policy documents, and semantic reasoning, the framework contributes to adaptive policymaking infrastructures that can evolve with complex socio-technical challenges. This approach aims to support a shift from siloed modeling to reflexive, modular systems designed for dynamic, context-aware climate action.
Opinion Dynamics Models for Sentiment Evolution in Weibo Blogs
He, Yulong, Proskurnikov, Anton V., Sedakov, Artem
Online social media platforms enable influencers to distribute content and quickly capture audience reactions, significantly shaping their promotional strategies and advertising agreements. Understanding how sentiment dynamics and emotional contagion unfold among followers is vital for influencers and marketers, as these processes shape engagement, brand perception, and purchasing behavior. While sentiment analysis tools effectively track sentiment fluctuations, dynamical models explaining their evolution remain limited, often neglecting network structures and interactions both among blogs and between their topic-focused follower groups. In this study, we tracked influential tech-focused Weibo bloggers over six months, quantifying follower sentiment from text-mined feedback. By treating each blogger's audience as a single "macro-agent", we find that sentiment trajectories follow the principle of iterative averaging -- a foundational mechanism in many dynamical models of opinion formation, a theoretical framework at the intersection of social network analysis and dynamical systems theory. The sentiment evolution aligns closely with opinion-dynamics models, particularly modified versions of the classical French-DeGroot model that incorporate delayed perception and distinguish between expressed and private opinions. The inferred influence structures reveal interdependencies among blogs that may arise from homophily, whereby emotionally similar users subscribe to the same blogs and collectively shape the shared sentiment expressed within these communities.
Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation
Chen, Jianming, Wang, Yawen, Wang, Junjie, Xie, Xiaofei, Hu, Yuanzhe, Wang, Qing, Xu, Fanjiang
Evaluating security and reliability for multi-agent systems (MAS) is urgent as they become increasingly prevalent in various applications. As an evaluation technique, existing adversarial attack frameworks face certain limitations, e.g., impracticality due to the requirement of white-box information or high control authority, and a lack of stealthiness or effectiveness as they often target all agents or specific fixed agents. To address these issues, we propose AdapAM, a novel framework for adversarial attacks on black-box MAS. AdapAM incorporates two key components: (1) Adaptive Selection Policy simultaneously selects the victim and determines the anticipated malicious action (the action would lead to the worst impact on MAS), balancing effectiveness and stealthiness. (2) Proxy-based Perturbation to Induce Malicious Action utilizes generative adversarial imitation learning to approximate the target MAS, allowing AdapAM to generate perturbed observations using white-box information and thus induce victims to execute malicious action in black-box settings. We evaluate AdapAM across eight multi-agent environments and compare it with four state-of-the-art and commonly-used baselines. Results demonstrate that AdapAM achieves the best attack performance in different perturbation rates. Besides, AdapAM-generated perturbations are the least noisy and hardest to detect, emphasizing the stealthiness.
Optimizing Robot Positioning Against Placement Inaccuracies: A Study on the Fanuc CRX10iA/L
Gautier, Nicolas, Guillermit, Yves, Porez, Mathieu, Lemoine, David, Chablat, Damien
This study presents a methodology for determining the optimal base placement of a Fanuc CRX10iA/L collaborative robot for a desired trajectory corresponding to an industrial task. The proposed method uses a particle swarm optimization algorithm that explores the search space to find positions for performing the trajectory. An $ฮฑ$-shape algorithm is then used to draw the borders of the feasibility areas, and the largest circle inscribed is calculated from the Voronoi diagrams. The aim of this approach is to provide a robustness criterion in the context of robot placement inaccuracies that may be encountered, for example, if the robot is placed on a mobile base when the system is deployed by an operator. The approach developed uses an inverse kinematics model to evaluate all initial configurations, then moves the robot end-effector along the reference trajectory using the Jacobian matrix and assigns a score to the attempt. For the Fanuc CRX10iA/L robot, there can be up to 16 solutions to the inverse kinematics model. The calculation of these solutions is not trivial and requires a specific study that planning tools such as MoveIt cannot fully take into account. Additionally, the optimization process must consider constraints such as joint limits, singularities, and workspace limitations to ensure feasible and efficient trajectory execution.
Modelling and Model-Checking a ROS2 Multi-Robot System using Timed Rebeca
Trinh, Hiep Hong, Sirjani, Marjan, Ciccozzi, Federico, Masud, Abu Naser, Sjรถdin, Mikael
Model-based development enables quicker prototyping, earlier experimentation and validation of design intents. For a multi-agent system with complex asynchronous interactions and concurrency, formal verification, model-checking in particular, offers an automated mechanism for verifying desired properties. Timed Rebeca is an actor-based modelling language supporting reactive, concurrent and time semantics, accompanied with a model-checking compiler. These capabilities allow using Timed Rebeca to correctly model ROS2 node topographies, recurring physical signals, motion primitives and other timed and time-convertible behaviors. The biggest challenges in modelling and verifying a multi-robot system lie in abstracting complex information, bridging the gap between a discrete model and a continuous system and compacting the state space, while maintaining the model's accuracy. We develop different discretization strategies for different kinds of information, identifying the 'enough' thresholds of abstraction, and applying efficient optimization techniques to boost computations. With this work we demonstrate how to use models to design and verify a multi-robot system, how to discretely model a continuous system to do model-checking efficiently, and the round-trip engineering flow between the model and the implementation. The released Rebeca and ROS2 codes can serve as a foundation for modelling multiple autonomous robots systems.
Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents
Bradland, Henrik, Goodwin, Morten, Zadorozhny, Vladimir I., Andersen, Per-Arne
The performance of machine learning models on tabular data is critically dependent on high-quality feature engineering. While Large Language Models (LLMs) have shown promise in automating feature extraction (AutoFE), existing methods are often limited by monolithic LLM architectures, simplistic quantitative feedback, and a failure to systematically integrate external domain knowledge. This paper introduces Rogue One, a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction. Rogue One operationalizes a decentralized system of three specialized agents-Scientist, Extractor, and Tester-that collaborate iteratively to discover, generate, and validate predictive features. Crucially, the framework moves beyond primitive accuracy scores by introducing a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, allowing it to dynamically balance feature exploration and exploitation. By actively incorporating external knowledge via an integrated retrieval-augmented (RAG) system, Rogue One generates features that are not only statistically powerful but also semantically meaningful and interpretable. We demonstrate that Rogue One significantly outperforms state-of-the-art methods on a comprehensive suite of 19 classification and 9 regression datasets. Furthermore, we show qualitatively that the system surfaces novel, testable hypotheses, such as identifying a new potential biomarker in the myocardial dataset, underscoring its utility as a tool for scientific discovery.