Goto

Collaborating Authors

 middleware


xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

Shi, Jiabo, Pezaros, Dimitrios, Elkhatib, Yehia

arXiv.org Artificial Intelligence

The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements a priori. We conducted a thorough evaluation of xMem against state-of-the-art solutions using workloads from 25 different models, including architectures like Convolutional Neural Networks and Transformers. The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits: it decreases the median relative error by 91% and significantly reduces the probability of estimation failure as safe OOM thresholds by 75%, meaning that the estimated value can often be used directly without causing OOM. Ultimately, these improvements lead to a 368% increase in memory conservation potential over current solutions.


Design Process of a Self Adaptive Smart Serious Games Ecosystem

Tao, X., Chen, P., Tsami, M., Khayati, F., Eckert, M.

arXiv.org Artificial Intelligence

Abstract--This paper outlines the design vision and planned evolution of Blexer v3, a modular and AI-driven rehabilitation ecosystem based on serious games. Building on insights from previous versions of the system, we propose a new architecture that aims to integrate multimodal sensing, real-time reasoning, and intelligent control. The envisioned system will include distinct modules for data collection, user state inference, and gameplay adaptation. Key features such as dynamic difficulty adjustment (DDA) and procedural content generation (PCG) are also considered to support personalized interventions. We present the complete conceptual framework of Blexer v3, which defines the modular structure and data flow of the system. This serves as the foundation for the next phase: the development of a functional prototype and its integration into clinical rehabilitation scenarios. Video games have evolved significantly since their inception in the 1960s, becoming a cultural force in the late 1980s and early 1990s [1]. With the growth of the videogame industry, games have expanded into fields such as education, military, and healthcare, known as Serious Games (SGs) [2]. In healthcare, SGs have shown promise in screening [3] and rehabilitation [4].


Service Discovery-Based Hybrid Network Middleware for Efficient Communication in Distributed Robotic Systems

Sang, Shiyao, Ling, Yinggang

arXiv.org Artificial Intelligence

Robotic middleware is fundamental to ensuring reliable communication among system components and is crucial for intelligent robotics, autonomous vehicles, and smart manufacturing. However, existing robotic middleware often struggles to meet the diverse communication demands, optimize data transmission efficiency, and maintain scheduling determinism between Orin computing units in large-scale L4 autonomous vehicle deployments. This paper presents RIMAOS2C, a service discovery-based hybrid network communication middleware designed to tackle these challenges. By leveraging multi-level service discovery multicast, RIMAOS2C supports a wide variety of communication modes, including multiple cross-chip Ethernet protocols and PCIe communication capabilities. Its core mechanism, the Message Bridge, optimizes data flow forwarding and employs shared memory for centralized message distribution, reducing message redundancy and minimizing transmission delay uncertainty. Tested on L4 vehicles and Jetson Orin domain controllers, RIMAOS2C leverages TCP-based ZeroMQ to overcome the large-message transmission bottleneck in native CyberRT. In scenarios with two cross-chip subscribers, it eliminates message redundancy and improves large-data transmission efficiency by 36 to 40 percent while reducing callback latency variation by 42 to 906 percent. This research advances the communication capabilities of robotic operating systems and proposes a novel approach to optimizing communication in distributed computing architectures for autonomous driving.


ROS 2 Agnocast: Supporting Unsized Message Types for True Zero-Copy Publish/Subscribe IPC

Ishikawa-Aso, Takahiro, Kato, Shinpei

arXiv.org Artificial Intelligence

Robot applications, comprising independent components that mutually publish/subscribe messages, are built on inter-process communication (IPC) middleware such as Robot Operating System 2 (ROS 2). In large-scale ROS 2 systems like autonomous driving platforms, true zero-copy communication -- eliminating serialization and deserialization -- is crucial for efficiency and real-time performance. However, existing true zero-copy middleware solutions lack widespread adoption as they fail to meet three essential requirements: 1) Support for all ROS 2 message types including unsized ones; 2) Minimal modifications to existing application code; 3) Selective implementation of zero-copy communication between specific nodes while maintaining conventional communication mechanisms for other inter-node communications including inter-host node communications. This first requirement is critical, as production-grade ROS 2 projects like Autoware rely heavily on unsized message types throughout their codebase to handle diverse use cases (e.g., various sensors), and depend on the broader ROS 2 ecosystem, where unsized message types are pervasive in libraries. The remaining requirements facilitate seamless integration with existing projects. While IceOryx middleware, a practical true zero-copy solution, meets all but the first requirement, other studies achieving the first requirement fail to satisfy the remaining criteria. This paper presents Agnocast, a true zero-copy IPC framework applicable to ROS 2 C++ on Linux that fulfills all these requirements. Our evaluation demonstrates that Agnocast maintains constant IPC overhead regardless of message size, even for unsized message types. In Autoware PointCloud Preprocessing, Agnocast achieves a 16% improvement in average response time and a 25% improvement in worst-case response time.


UnifyFL: Enabling Decentralized Cross-Silo Federated Learning

S, Sarang, Dhakshinamoorthy, Druva, Sharma, Aditya Shiva, Bhadauria, Yuvraj Singh, Vivek, Siddharth Chaitra, Bansal, Arihant, Paul, Arnab K.

arXiv.org Artificial Intelligence

Federated Learning (FL) is a decentralized machine learning (ML) paradigm in which models are trained on private data across several devices called clients and combined at a single node called an aggregator rather than aggregating the data itself. Many organizations employ FL to have better privacy-aware ML-driven decision-making capabilities. However, organizations often operate independently rather than collaborate to enhance their FL capabilities due to the lack of an effective mechanism for collaboration. The challenge lies in balancing trust and resource efficiency. One approach relies on trusting a third-party aggregator to consolidate models from all organizations (multilevel FL), but this requires trusting an entity that may be biased or unreliable. Alternatively, organizations can bypass a third party by sharing their local models directly, which requires significant computational resources for validation. Both approaches reflect a fundamental trade-off between trust and resource constraints, with neither offering an ideal solution. In this work, we develop a trust-based cross-silo FL framework called UnifyFL, which uses decentralized orchestration and distributed storage. UnifyFL provides flexibility to the participating organizations and presents synchronous and asynchronous modes to handle stragglers. Our evaluation on a diverse testbed shows that UnifyFL achieves a performance comparable to the ideal multilevel centralized FL while allowing trust and optimal use of resources.


Simulation to Reality: Testbeds and Architectures for Connected and Automated Vehicles

Klüner, David, Schäfer, Simon, Hegerath, Lucas, Xu, Jianye, Kahle, Julius, Ibrahim, Hazem, Kampmann, Alexandru, Alrifaee, Bassam

arXiv.org Artificial Intelligence

Ensuring the safe and efficient operation of CAVs relies heavily on the software framework used. A software framework needs to ensure real-time properties, reliable communication, and efficient resource utilization. Furthermore, a software framework needs to enable seamless transition between testing stages, from simulation to small-scale to full-scale experiments. In this paper, we survey prominent software frameworks used for in-vehicle and inter-vehicle communication in CAVs. We analyze these frameworks regarding opportunities and challenges, such as their real-time properties and transitioning capabilities. Additionally, we delve into the tooling requirements necessary for addressing the associated challenges. We illustrate the practical implications of these challenges through case studies focusing on critical areas such as perception, motion planning, and control. Furthermore, we identify research gaps in the field, highlighting areas where further investigation is needed to advance the development and deployment of safe and efficient CAV systems.


CrowdHMTware: A Cross-level Co-adaptation Middleware for Context-aware Mobile DL Deployment

Liu, Sicong, Guo, Bin, Luo, Shiyan, Wang, Yuzhan, Luo, Hao, Fang, Cheng, Xu, Yuan, Ma, Ke, Li, Yao, Yu, Zhiwen

arXiv.org Artificial Intelligence

There are many deep learning (DL) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives.To enable robust and private mobile sensing, DL models are often deployed locally on resource-constrained mobile devices using techniques such as model compression or offloading.However, existing methods, either front-end algorithm level (i.e. DL model compression/partitioning) or back-end scheduling level (i.e. operator/resource scheduling), cannot be locally online because they require offline retraining to ensure accuracy or rely on manually pre-defined strategies, struggle with dynamic adaptability.The primary challenge lies in feeding back runtime performance from the back-end level to the front-end level optimization decision. Moreover, the adaptive mobile DL model porting middleware with cross-level co-adaptation is less explored, particularly in mobile environments with diversity and dynamics. In response, we introduce CrowdHMTware, a dynamic context-adaptive DL model deployment middleware for heterogeneous mobile devices. It establishes an automated adaptation loop between cross-level functional components, i.e. elastic inference, scalable offloading, and model-adaptive engine, enhancing scalability and adaptability. Experiments with four typical tasks across 15 platforms and a real-world case study demonstrate that CrowdHMTware can effectively scale DL model, offloading, and engine actions across diverse platforms and tasks. It hides run-time system issues from developers, reducing the required developer expertise.


Open-Source Autonomous Driving Software Platforms: Comparison of Autoware and Apollo

Jung, Hee-Yang, Paek, Dong-Hee, Kong, Seung-Hyun

arXiv.org Artificial Intelligence

Full-stack autonomous driving system spans diverse technological domains-including perception, planning, and control-that each require in-depth research. Moreover, validating such technologies of the system necessitates extensive supporting infrastructure, from simulators and sensors to high-definition maps. These complexities with barrier to entry pose substantial limitations for individual developers and research groups. Recently, open-source autonomous driving software platforms have emerged to address this challenge by providing autonomous driving technologies and practical supporting infrastructure for implementing and evaluating autonomous driving functionalities. Among the prominent open-source platforms, Autoware and Apollo are frequently adopted in both academia and industry. While previous studies have assessed each platform independently, few have offered a quantitative and detailed head-to-head comparison of their capabilities. In this paper, we systematically examine the core modules of Autoware and Apollo and evaluate their middleware performance to highlight key differences. These insights serve as a practical reference for researchers and engineers, guiding them in selecting the most suitable platform for their specific development environments and advancing the field of full-stack autonomous driving system.


FastDDS-Based Middleware System for Remote X-Ray Image Classification Using Raspberry Pi

Khater, Omar H., Almadani, Basem, Aliyu, Farouq

arXiv.org Artificial Intelligence

Internet of Things (IoT) based healthcare systems offer significant potential for improving the delivery of healthcare services in humanitarian engineering, providing essential healthcare services to millions of underserved people in remote areas worldwide. However, these areas have poor network infrastructure, making communications difficult for traditional IoT. This paper presents a real-time chest X-ray classification system for hospitals in remote areas using FastDDS real-time middleware, offering reliable real-time communication. We fine-tuned a ResNet50 neural network to an accuracy of 88.61%, a precision of 88.76%, and a recall of 88.49\%. Our system results mark an average throughput of 3.2 KB/s and an average latency of 65 ms. The proposed system demonstrates how middleware-based systems can assist doctors in remote locations.


Modern Middlewares for Automated Vehicles: A Tutorial

Klüner, David Philipp, Molz, Marius, Kampmann, Alexandru, Kowalewski, Stefan, Alrifaee, Bassam

arXiv.org Artificial Intelligence

This paper offers a tutorial on current middlewares in automated vehicles. Our aim is to provide the reader with an overview of current middlewares and to identify open challenges in this field. We start by explaining the fundamentals of software architecture in distributed systems and the distinguishing requirements of Automated Vehicles. We then distinguish between communication middlewares and architecture platforms and highlight their key principles and differences. Next, we present five state-of-the-art middlewares as well as their capabilities and functions. We explore how these middlewares could be applied in the design of future vehicle software and their role in the automotive domain. Finally, we compare the five middlewares presented and discuss open research challenges.