Goto

Collaborating Authors

 testbed



A code

Neural Information Processing Systems

This section is meant to give an overview of our opensource code. Together with this git repo, we include a'tutorial colab' - a Jupyter notebooks that can be run in the browser without requiring any local installation at We view this open-source effort as a major contribution of our paper. We present the testbed pseudocode in this section. Recall from Section 3.1 that we We now describe the other parameters we use in the Testbed. In this section, we describe the benchmark agents in Section 3.3 and the choice of various Step 3: compute likelihoods for n = 1, 2, . . .


Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

Happe, Andreas, Cito, Jürgen

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. Due to the opaque nature of LLMs, empirical methods are typically used to analyze their efficacy. The quality of this analysis is highly dependent on the chosen testbed, captured metrics and analysis methods employed. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 19 research papers detailing 18 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios.


What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Neural Information Processing Systems

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching – a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time.


Simulation to Reality: Testbeds and Architectures for Connected and Automated Vehicles

Klüner, David, Schäfer, Simon, Hegerath, Lucas, Xu, Jianye, Kahle, Julius, Ibrahim, Hazem, Kampmann, Alexandru, Alrifaee, Bassam

arXiv.org Artificial Intelligence

Ensuring the safe and efficient operation of CAVs relies heavily on the software framework used. A software framework needs to ensure real-time properties, reliable communication, and efficient resource utilization. Furthermore, a software framework needs to enable seamless transition between testing stages, from simulation to small-scale to full-scale experiments. In this paper, we survey prominent software frameworks used for in-vehicle and inter-vehicle communication in CAVs. We analyze these frameworks regarding opportunities and challenges, such as their real-time properties and transitioning capabilities. Additionally, we delve into the tooling requirements necessary for addressing the associated challenges. We illustrate the practical implications of these challenges through case studies focusing on critical areas such as perception, motion planning, and control. Furthermore, we identify research gaps in the field, highlighting areas where further investigation is needed to advance the development and deployment of safe and efficient CAV systems.


Resource Utilization Optimized Federated Learning

Zhang, Zihan, Wong, Leon, Varghese, Blesson

arXiv.org Artificial Intelligence

Zihan Zhang University of St Andrews, UK Leon Wong Rakuten Mobile, Inc., Japan Blesson V arghese University of St Andrews, UK Abstract --Federated learning (FL) systems facilitate distributed machine learning across a server and multiple devices. However, FL systems have low resource utilization limiting their practical use in the real world. This inefficiency primarily arises from two types of idle time: (i) task dependency between the server and devices, and (ii) stragglers among heterogeneous devices. This paper introduces FedOptima, a resource-optimized FL system designed to simultaneously minimize both types of idle time; existing systems do not eliminate or reduce both at the same time. First, devices operate independently of each other using asynchronous aggregation to eliminate straggler effects, and independently of the server by utilizing auxiliary networks to minimize idle time caused by task dependency. Second, the server performs centralized training using a task scheduler that ensures balanced contributions from all devices, improving model accuracy. Four state-of-the-art offloading-based and asynchronous FL methods are chosen as baselines. Experimental results show that compared to the best results of the baselines on convolutional neural networks and transformers on multiple lab-based testbeds, FedOptima (i) achieves higher or comparable accuracy, (ii) accelerates training by 1.9 to 21.8, (iii) reduces server and device idle time by up to 93.9% and 81.8%, respectively, and (iv) increases throughput by 1.1 to 2.0 . Index T erms--federated learning, distributed system, resource utilization, idle time, edge computing I. I NTRODUCTION Federated learning (FL) [1]-[3] offers distributed training across user devices as an alternative to traditional centralized machine training. Devices train a deep neural network (DNN) on their data and send model parameters to the server. The server aggregates these into a global model, which is then distributed to the devices for the next round. Thus, FL utilizes insight from user data via local models to train a global model without sharing original data. Sub-optimal resource utilization is a critical problem in FL that results in two types of idle time on the server and devices (see Section II-A). The first is due to task dependency between server and devices - the server is idle for considerable periods when aggregating local models from devices as it waits for on-device training to complete, which is usually time-consuming. The second is due to hardware heterogeneity - stragglers or slower devices require more time to train than faster devices that idle while waiting for the stragglers. Two categories of methods are considered in the existing literature for reducing idle time.


Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms: Challenges and a Roadmap

Xu, Jianye, Alrifaee, Bassam, Betz, Johannes, Mokhtarian, Armin, Mittal, Archak, Cai, Mengchi, Mangharam, Rahul, Shehata, Omar M., Elias, Catherine M., Zaech, Jan-Nico, Scheffe, Patrick, Jahncke, Felix, Ulhas, Sangeet Sankaramangalam, Arfvidsson, Kaj Munhoz

arXiv.org Artificial Intelligence

This article proposes a roadmap to address the current challenges in small-scale testbeds for Connected and Automated Vehicles (CAVs) and robot swarms. The roadmap is a joint effort of participants in the workshop "1st Workshop on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms," held on June 2 at the IEEE Intelligent Vehicles Symposium (IV) 2024 in Jeju, South Korea. The roadmap contains three parts: 1) enhancing accessibility and diversity, especially for underrepresented communities, 2) sharing best practices for the development and maintenance of testbeds, and 3) connecting testbeds through an abstraction layer to support collaboration. The workshop features eight invited speakers, four contributed papers [1]-[4], and a presentation of a survey paper on testbeds [5]. The survey paper provides an online comparative table of more than 25 testbeds, available at https://bassamlab.github.io/testbeds-survey. The workshop's own website is available at https://cpm-remote.lrt.unibw-muenchen.de/iv24-workshop.


VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

Xu, Zelai, Yu, Chao, Zhang, Ruize, Yuan, Huining, Yi, Xiangmin, Ji, Shilong, Wang, Chuqi, Tang, Wenhao, Wang, Yu

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning (MARL) has made significant progress, largely fueled by the development of specialized testbeds that enable systematic evaluation of algorithms in controlled yet challenging scenarios. However, existing testbeds often focus on purely virtual simulations or limited robot morphologies such as robotic arms, quadrupeds, and humanoids, leaving high-mobility platforms with real-world physical constraints like drones underexplored. To bridge this gap, we present VolleyBots, a new MARL testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots features a turn-based interaction model under volleyball rules, a hierarchical decision-making process that combines motion control and strategic play, and a high-fidelity simulation for seamless sim-to-real transfer. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative MARL and game-theoretic algorithms. Results in simulation show that while existing algorithms handle simple tasks effectively, they encounter difficulty in complex tasks that require both low-level control and high-level strategy. We further demonstrate zero-shot deployment of a simulation-learned policy to real-world drones, highlighting VolleyBots' potential to propel MARL research involving agile robotic platforms. The project page is at https://sites.google.com/view/thu-volleybots/home.


Past, Present, Future: A Comprehensive Exploration of AI Use Cases in the UMBRELLA IoT Testbed

Li, Peizheng, Mavromatis, Ioannis, Khan, Aftab

arXiv.org Artificial Intelligence

UMBRELLA is a large-scale, open-access Internet of Things (IoT) ecosystem incorporating over 200 multi-sensor multi-wireless nodes, 20 collaborative robots, and edge-intelligence-enabled devices. This paper provides a guide to the implemented and prospective artificial intelligence (AI) capabilities of UMBRELLA in real-world IoT systems. Four existing UMBRELLA applications are presented in detail: 1) An automated streetlight monitoring for detecting issues and triggering maintenance alerts; 2) A Digital twin of building environments providing enhanced air quality sensing with reduced cost; 3) A large-scale Federated Learning framework for reducing communication overhead; and 4) An intrusion detection for containerised applications identifying malicious activities. Additionally, the potential of UMBRELLA is outlined for future smart city and multi-robot crowdsensing applications enhanced by semantic communications and multi-agent planning. Finally, to realise the above use-cases we discuss the need for a tailored MLOps platform to automate UMBRELLA model pipelines and establish trust.


A Testbed for Automating and Analysing Mobile Devices and their Applications

Simpson, Lachlan, Millar, Kyle, Cheng, Adriel, Chew, Hong Gunn, Lim, Cheng-Chew

arXiv.org Artificial Intelligence

The need for improved network situational awareness has been highlighted by the growing complexity and severity of cyber-attacks. Mobile phones pose a significant risk to network situational awareness due to their dynamic behaviour and lack of visibility on a network. Machine learning techniques enhance situational awareness by providing administrators insight into the devices and activities which form their network. Developing machine learning techniques for situational awareness requires a testbed to generate and label network traffic. Current testbeds, however, are unable to automate the generation and labelling of realistic network traffic. To address this, we describe a testbed which automates applications on mobile devices to generate and label realistic traffic. From this testbed, two labelled datasets of network traffic have been created. We provide an analysis of the testbed automation reliability and benchmark the datasets for the task of application classification.