Agents
Simultaneously Fair Allocation of Indivisible Items Across Multiple Dimensions
Kawase, Yasushi, Roy, Bodhayan, Sanpui, Mohammad Azharuddin
This paper explores the fair allocation of indivisible items in a multidimensional setting, motivated by the need to address fairness in complex environments where agents assess bundles according to multiple criteria. Such multidimensional settings are not merely of theoretical interest but are central to many real-world applications. For example, cloud computing resources are evaluated based on multiple criteria such as CPU cores, memory, and network bandwidth. In such cases, traditional one dimensional fairness notions fail to capture fairness across multiple attributes. To address these challenges, we study two relaxed variants of envy-freeness: weak simultaneously envy-free up to c goods (weak sEFc) and strong simultaneously envy-free up to c goods (strong sEFc), which accommodate the multidimensionality of agents' preferences. Under the weak notion, for every pair of agents and for each dimension, any perceived envy can be eliminated by removing, if necessary, a different set of goods from the envied agent's allocation. In contrast, the strong version requires selecting a single set of goods whose removal from the envied bundle simultaneously eliminates envy in every dimension. We provide upper and lower bounds on the relaxation parameter c that guarantee the existence of weak or strong sEFc allocations, where these bounds are independent of the total number of items. In addition, we present algorithms for checking whether a weak or strong sEFc allocation exists. Moreover, we establish NP-hardness results for checking the existence of weak sEF1 and strong sEF1 allocations.
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Sun, Qiushi, Liu, Zhoumianze, Ma, Chang, Ding, Zichen, Xu, Fangzhi, Yin, Zhangyue, Zhao, Haiteng, Wu, Zhenyu, Cheng, Kanzhi, Liu, Zhaoyang, Wang, Jianing, Li, Qintong, Tang, Xiangru, Xie, Tianbao, Feng, Xiachong, Li, Xiang, Kao, Ben, Wang, Wenhai, Qi, Biqing, Kong, Lingpeng, Wu, Zhiyong
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.
$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
Yu, Peijie, Yang, Yifan, Li, Jinjian, Zhang, Zelong, Wang, Haorui, Feng, Xiao, Zhang, Feng
Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/TencentHunyuan/C3-Benchmark.
TrajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge
Zhang, Zhiyuan, Jia, Xiaosong, Chen, Guanyu, Li, Qifeng, Yan, Junchi
In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.
SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2
Bouamra, Yasmine, Yun, Bruno, Poisson, Alexandre, Armetta, Frédéric
The automatic generation of SysML v2 models represents a major challenge in the engineering of complex systems, particularly due to the scarcity of learning corpora and complex syntax. We present SysTemp, a system aimed at facilitating and improving the creation of SysML v2 models from natural language specifications. It is based on a multi-agent system, including a template generator that structures the generation process. We discuss the advantages and challenges of this system through an evaluation, highlighting its potential to improve the quality of the generations in SysML v2 modeling.
I Let AI Agents Plan My Vacation--and It Wasn't Terrible
The worst part of travel is the planning: the faff of finding and booking transport, accommodation, restaurant reservations--the list can feel endless. To help, the latest wave of AI agents, such as OpenAI's Operator and Anthropic's Computer Use claim they can take these dreary, cumbersome tasks from befuddled travelers and do it all for you. But exactly how good are they are digging out the good stuff? What better way to find out than deciding on a last-minute weekend away. I tasked Operator, which is available to ChatGPT Pro subscribers, with booking me something budget-friendly, with good food and art, and told it that I'd prefer to travel by train.
xChemAgents: Agentic AI for Explainable Quantum Chemistry
Polat, Can, Tuncel, Mehmet, Kurban, Mustafa, Serpedin, Erchin, Kurban, Hasan
Recent progress in multimodal graph neural networks has demonstrated that augmenting atomic XYZ geometries with textual chemical descriptors can enhance predictive accuracy across a range of electronic and thermodynamic properties. However, naively appending large sets of heterogeneous descriptors often degrades performance on tasks sensitive to molecular shape or symmetry, and undermines interpretability. xChemAgents proposes a cooperative agent framework that injects physics-aware reasoning into multimodal property prediction. xChemAgents comprises two language-model-based agents: a Selector, which adaptively identifies a sparse, weighted subset of descriptors relevant to each target, and provides a natural language rationale; and a Validator, which enforces physical constraints such as unit consistency and scaling laws through iterative dialogue. On standard benchmark datasets, xChemAgents achieves up to a 22% reduction in mean absolute error over the state-of-the-art baselines, while producing faithful, human-interpretable explanations. Experiment results highlight the potential of cooperative, self-verifying agents to enhance both accuracy and transparency in foundation-model-driven materials science. The implementation and accompanying dataset are available at https://github.com/KurbanIntelligenceLab/xChemAgents.
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Men, Tianyi, Jin, Zhuoran, Cao, Pengfei, Chen, Yubo, Liu, Kang, Zhao, Jun
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
Artificial Delegates Resolve Fairness Issues in Perpetual Voting with Partial Turnout
Shah, Apurva, Abels, Axel, Nowé, Ann, Lenaerts, Tom
Perpetual voting considers sequences of decis ions made by the same electorate, where fairness must be evaluated over time rather than perdecision [16]. A centralchallenge in this setting is ensuring adequaterepresentation for voters who are repeatedly in the minority. Traditional a ggregation rules, such as majority voting or Borda count, fail in this regard: they offer no guarantees of long-term fai rness or cumulative influence. In response, methods such as Perpetual Phragmén [17] and Perpetual Consensus [16] hav e been proposed to distribute influence more equitably over time. However, they rely on full knowledge of all voters ' approval sets, implicitly requiring consistent voter participation, a condition which can be hard to satisfy in real-world contexts. Real-world elections face various practical constraints-- including scheduling conflicts, limited resources, and restricted information access--that inevitably prevent vote rs from participating consistently.
Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games
This paper introduces a new approach for approximating the learning dynamics of multiple reinforcement learning (RL) agents interacting in a finite-state Markov game. The idea is to rescale the learning process by simultaneously reducing the learning rate and increasing the update frequency, effectively treating the agent's parameters as a slow-evolving variable influenced by the fast-mixing game state. Under mild assumptions-ergodicity of the state process and continuity of the updates-we prove the convergence of this rescaled process to an ordinary differential equation (ODE). This ODE provides a tractable, deterministic approximation of the agent's learning dynamics. An implementation of the framework is available at\,: https://github.com/yannKerzreho/MarkovGameApproximation