Agents
Alternates, Assemble! Selecting Optimal Alternates for Citizens' Assemblies
Assos, Angelos, Baharav, Carmel, Flanigan, Bailey, Procaccia, Ariel
Citizens' assemblies are an increasingly influential form of deliberative democracy, where randomly selected people discuss policy questions. The legitimacy of these assemblies hinges on their representation of the broader population, but participant dropout often leads to an unbalanced composition. In practice, dropouts are replaced by preselected alternates, but existing methods do not address how to choose these alternates. To address this gap, we introduce an optimization framework for alternate selection. Our algorithmic approach, which leverages learning-theoretic machinery, estimates dropout probabilities using historical data and selects alternates to minimize expected misrepresentation. Our theoretical bounds provide guarantees on sample complexity (with implications for computational efficiency) and on loss due to dropout probability mis-estimation. Empirical evaluation using real-world data demonstrates that, compared to the status quo, our method significantly improves representation while requiring fewer alternates.
A simulation framework for autonomous lunar construction work
Linde, Mattias, Lindmark, Daniel, Ålstig, Sandra, Servin, Martin
We present a simulation framework for lunar construction work involving multiple autonomous machines. The framework supports modelling of construction scenarios and autonomy solutions, execution of the scenarios in simulation, and analysis of work time and energy consumption throughout the construction project. The simulations are based on physics-based models for contacting multibody dynamics and deformable terrain, including vehicle-soil interaction forces and soil flow in real time. A behaviour tree manages the operational logic and error handling, which enables the representation of complex behaviours through a discrete set of simpler tasks in a modular hierarchical structure. High-level decision-making is separated from lower-level control algorithms, with the two connected via ROS2. Excavation movements are controlled through inverse kinematics and tracking controllers. The framework is tested and demonstrated on two different lunar construction scenarios that involve an excavator and dump truck with actively controlled articulated crawlers.
Making Teams and Influencing Agents: Efficiently Coordinating Decision Trees for Interpretable Multi-Agent Reinforcement Learning
Chen, Rex, Milani, Stephanie, Zhang, Zhicheng, Sadeh, Norman, Fang, Fei
Poor interpretability hinders the practical applicability of multi-agent reinforcement learning (MARL) policies. Deploying interpretable surrogates of uninterpretable policies enhances the safety and verifiability of MARL for real-world applications. However, if these surrogates are to interact directly with the environment within human supervisory frameworks, they must be both performant and computationally efficient. Prior work on interpretable MARL has either sacrificed performance for computational efficiency or computational efficiency for performance. To address this issue, we propose HYDRA VIPER, a decision tree-based interpretable MARL algorithm. HYDRA VIPER coordinates training between agents based on expected team performance, and adaptively allocates budgets for environment interaction to improve computational efficiency. Experiments on standard benchmark environments for multi-agent coordination and traffic signal control show that HYDRA VIPER matches the performance of state-of-the-art methods using a fraction of the runtime, and that it maintains a Pareto frontier of performance for different interaction budgets.
SWE-smith: Scaling Data for Software Engineering Agents
Yang, John, Lieret, Kilian, Jimenez, Carlos E., Wettig, Alexander, Khandpur, Kabir, Zhang, Yanzhe, Hui, Binyuan, Press, Ofir, Schmidt, Ludwig, Yang, Diyi
Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.
Enhancing Explainability and Reliable Decision-Making in Particle Swarm Optimization through Communication Topologies
Gupta, Nitin, Bala, Indu, Dutta, Bapi, Martínez, Luis, Yadav, Anupam
Swarm intelligence effectively optimizes complex systems across fields like engineering and healthcare, yet algorithm solutions often suffer from low reliability due to unclear configurations and hyperparameters. This study analyzes Particle Swarm Optimization (PSO), focusing on how different communication topologies Ring, Star, and Von Neumann affect convergence and search behaviors. Using an adapted IOHxplainer , an explainable benchmarking tool, we investigate how these topologies influence information flow, diversity, and convergence speed, clarifying the balance between exploration and exploitation. Through visualization and statistical analysis, the research enhances interpretability of PSO's decisions and provides practical guidelines for choosing suitable topologies for specific optimization tasks. Ultimately, this contributes to making swarm based optimization more transparent, robust, and trustworthy.
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Wang, Weixuan, Han, Dongge, Diaz, Daniel Madrigal, Xu, Jin, Rühle, Victor, Rajmohan, Saravan
Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation
Doropoulos, Stavros, Vologiannidis, Stavros, Magnisalis, Ioannis
The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3\% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.
AZRA: Extending the Affective Capabilities of Zoomorphic Robots using Augmented Reality
Macdonald, Shaun, ElSayed, Salma, McGill, Mark
Zoomorphic robots could serve as accessible and practical alternatives for users unable or unwilling to keep pets. However, their affective interactions are often simplistic and short-lived, limiting their potential for domestic adoption. In order to facilitate more dynamic and nuanced affective interactions and relationships between users and zoomorphic robots we present AZRA, a novel augmented reality (AR) framework that extends the affective capabilities of these robots without physical modifications. To demonstrate AZRA, we augment a zoomorphic robot, Petit Qoobo, with novel emotional displays (face, light, sound, thought bubbles) and interaction modalities (voice, touch, proximity, gaze). Additionally, AZRA features a computational model of emotion to calculate the robot's emotional responses, daily moods, evolving personality and needs. We highlight how AZRA can be used for rapid participatory prototyping and enhancing existing robots, then discuss implications on future zoomorphic robot development.
Know-How and Expertise: European Companies Hoping to Take the Global Lead in Industrial AI
Rückert's focus, though, is on more proactive AI applications that can make decisions on their own and control processes. Such AI agents, she believes, will give industry a boost comparable to the erstwhile advances triggered by smartphones and the internet. If a machine breaks down, the agent will check if the same problem has already been experienced in a different Bosch factory, examines handbooks and scans shift logs – before then proposing a possible solution within seconds. For more complex tasks, several agents can be combined, which then communicate with each other. Comprehensive use of such tools, says Rückert, can translate into millions in savings for individual factories.
Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning
Yi, Seungjun, Nguyen, Joakim, Xu, Huimin, Lim, Terence, Well, Andrew, Markey, Mia, Ding, Ying
Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.