Optimization
CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference
Xu, Guanyu, Hao, Zhiwei, Shen, Li, Luo, Yong, Sun, Fuhui, Wang, Xiaoyan, Hu, Han, Wen, Yonggang
--The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. T o tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer . The central idea behind CoFormer is to exploit the divisibility and integrability of transformer . An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1 inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance. Guanyu Xu, Zhiwei Hao and Han Hu are with the School of Information and Electrionics, Beijing Institute of Technology, Beijing 100081, China. Li Shen is with the School of Cyber Science and Technology, Shen-zhen Campus of Sun Y at-sen University, Shenzhen 518107, China. Y ong Luo is with the School of Computer Science, National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan 430072, China. Fuhui Sun and Xiaoyan Wang are with Information Technology Service Center of People's Court, Beijing, 100745, China. Y onggang Wen is with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798. CoFormer significantly outperforms other methods. Specifically, CoFormer accelerates inference speed by 3.1 compared to Swin-L [4] with only 1.7% accuracy sacrifice.
Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services
Tan, Renxuan, Li, Rongpeng, Yu, Xiaoxue, Chen, Xianfu, Xu, Xing, Zhao, Zhifeng
Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework's superiority is validated through extensive simulations -- PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.
UAV-UGV Cooperative Trajectory Optimization and Task Allocation for Medical Rescue Tasks in Post-Disaster Environments
Chen, Kaiyuan, Zhao, Wanpeng, Liu, Yongxi, Xia, Yuanqing, Liang, Wannian, Wang, Shuo
In post-disaster scenarios, rapid and efficient delivery of medical resources is critical and challenging due to severe damage to infrastructure. To provide an optimized solution, we propose a cooperative trajectory optimization and task allocation framework leveraging unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). This study integrates a Genetic Algorithm (GA) for efficient task allocation among multiple UAVs and UGVs, and employs an informed-RRT* (Rapidly-exploring Random Tree Star) algorithm for collision-free trajectory generation. Further optimization of task sequencing and path efficiency is conducted using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Simulation experiments conducted in a realistic post-disaster environment demonstrate that our proposed approach significantly improves the overall efficiency of medical rescue operations compared to traditional strategies. Specifically, our method reduces the total mission completion time to 26.7 minutes for a 15-task scenario, outperforming K-Means clustering and random allocation by over 73%. Furthermore, the framework achieves a substantial 15.1% reduction in total traveled distance after CMA-ES optimization. The cooperative utilization of UAVs and UGVs effectively balances their complementary advantages, highlighting the system's scalability and practicality for real-world deployment.
Enhanced Trust Region Sequential Convex Optimization for Multi-Drone Thermal Screening Trajectory Planning in Urban Environments
Chen, Kaiyuan, Hu, Zhengjie, Zhang, Shaolin, Xia, Yuanqing, Liang, Wannian, Wang, Shuo
--The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significant challenges, including collision avoidance, coverage efficiency, and constrained flight environments. In this study, we propose an enhanced trust region sequential convex optimization (TR-SCO) algorithm for optimal trajectory planning of multiple drones performing thermal screening tasks. Our improved algorithm integrates a refined convex optimization formulation within a trust region framework, effectively balancing trajectory smoothness, obstacle avoidance, altitude constraints, and maximum screening coverage. Simulation results demonstrate that our approach significantly improves trajectory optimality and computational efficiency compared to conventional convex optimization methods. This research provides critical insights and practical contributions toward deploying efficient multi-drone systems for real-time thermal screening in urban areas. This work is founded by National Natural Science Foundation of China.
Staircase Recognition and Location Based on Polarization Vision
-- Staircase perception is critical for humanoid robots and mobility -impaired individuals, yet existing methods have low accuracy, lighting sensitivity, and texture dependency. To address this, we propose a novel polarization-visual fusion framework that achieves robust staircase detection and high -precision the three-dimensional (3D) reconstruction, establishing a paradigm of S taircase recognition -- Heterogeneous sensor calibration (monocular and TOF camera) -- Polarization 3D reconstruction . First, the staircase recognition algorithm based on YOLOv11 integrated with polarization-intensity contrast enhancement algorithm and point cloud segmentation is improved, reaching recognition accuracy of 98.7% 0.10% by suppressing reflections and correcting by r edundant information of point cloud. Then, an improved gray wolf optimizer with Levy flight and d ynamic weights enable s accurate heterogeneous sensor calibration ( 0.33 0.04 mm error) between heterogeneous-resolution cameras is employed . Finally, a method of fusing polarized binocular and TOF depth information to realize the 3D reconstruction of the staircase is proposed . Considering the ambiguity in polarization reconstruction and the data holes in binocular reconstruction, b inocular vision is used to correct polarization azimuth ambiguity, TOF is used to fill data holes from stereo matching. Experiments show our method achieves <0.2% reconstruction error at 0.5m - significantly outperforming binocular (surface distortion) and polarization-based (normal vector ambiguity) approaches. This technology provides accurate terrain adaptation for robot ic foothold planning. INTRODUCTION A s a general scene, the staircase interferes with the traversal of h umanoid robots, legged robots, lower limb disabilities, or visually impaired individuals due to its special physical structure. Accurate staircase recognition technology is a prerequisite for navigation and control, and staircase recognition technology has attracted the attention of man y scholars [1],[2],[3] . Staircase recognition is of great significance for the mode switching and foothold position calculation of robots, which can improve the overall performance of robots in stair case scenes. As a common terrain, stairs are very difficult for humanoid robots and people with lower limb disabilities or visual impairments. Therefore, it is of great significance to design a staircase scene perception algorithm. At present, the staircase recognition is mainly applied in the fields of rehabilitation medicine and humanoid robots [ 4 ].
Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise
Yang, Xiaoxuan, Belakaria, Syrine, Joardar, Biresh Kumar, Yang, Huanrui, Doppa, Janardhan Rao, Pande, Partha Pratim, Chakrabarty, Krishnendu, Li, Hai
--Resistive random-access memory (ReRAM) is a promising technology for designing hardware accelerators for deep neural network (DNN) inferencing. We propose the design and optimization of a high-performance, area-and energy-efficient ReRAMbased hardware accelerator to achieve robust DNN inferencing in the presence of stochastic noise. We make two key technical contributions. First, we propose a stochastic-noise-aware training method, referred to as ReSNA, to improve the accuracy of DNN inferencing on ReRAM crossbars with stochastic noise. Second, we propose an information-theoretic algorithm, referred to as CF-MESMO, to identify the Pareto set of solutions to trade-off multiple objectives, including inferencing accuracy, area overhead, execution time, and energy consumption. The main challenge in this context is that executing the ReSNA method to evaluate each candidate ReRAM design is prohibitive. T o address this challenge, we utilize the continuous-fidelity evaluation of ReRAM designs associated with prohibitive high computation cost by varying the number of training epochs to trade-off accuracy and cost. CF-MESMO iteratively selects the candidate ReRAM design and fidelity pair that maximizes the information gained per unit computation cost about the optimal Pareto front. Our experiments on benchmark DNNs show that the proposed algorithms efficiently uncover high-quality Pareto fronts. On average, ReSNA achieves 2. 57% inferencing accuracy improvement for ResNet20 on the CIF AR-10 dataset with respect to the baseline configuration. Moreover, CF-MESMO algorithm achieves 90. Resistive random access memory (ReRAM) has emerged as a promising nonvolatile memory technology due to its multi-level cell, small cell size, and low access time and energy consumption. Prior work has shown that the crossbar structure of ReRAM arrays can efficiently execute matrix-vector multiplication [1], [2], the predominant computational kernel associated with deep neural networks (DNNs). ReRAM-based accelerators for fast and efficient DNN training and inferencing have been extensively studied [3]-[8]. However, a key challenge in executing DNN inferencing [9]- [11] on ReRAM-based architecture arises due to nonidealities of ReRAM devices, which can degrade the accuracy of inferencing.
InSQuAD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity
Nanda, Souradeep, Majee, Anay, Iyer, Rishabh
In this paper, we introduce InSQuAD, designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information} (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuAD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuAD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing multi-hop question answering dataset with synthetically generated paraphrases. Adopting the retrieval model trained using this strategy alongside the novel targeted selection formulation for ICL on nine benchmark datasets shows significant improvements validating the efficacy of our approach.
COMETH: Convex Optimization for Multiview Estimation and Tracking of Humans
Martini, Enrico, Choi, Ho Jin, Figueroa, Nadia, Bombieri, Nicola
In the era of Industry 5.0, monitoring human activity is essential for ensuring both ergonomic safety and overall well-being. While multi-camera centralized setups improve pose estimation accuracy, they often suffer from high computational costs and bandwidth requirements, limiting scalability and real-time applicability. Distributing processing across edge devices can reduce network bandwidth and computational load. On the other hand, the constrained resources of edge devices lead to accuracy degradation, and the distribution of computation leads to temporal and spatial inconsistencies. We address this challenge by proposing COMETH (Convex Optimization for Multiview Estimation and Tracking of Humans), a lightweight algorithm for real-time multi-view human pose fusion that relies on three concepts: it integrates kinematic and biomechanical constraints to increase the joint positioning accuracy; it employs convex optimization-based inverse kinematics for spatial fusion; and it implements a state observer to improve temporal consistency. We evaluate COMETH on both public and industrial datasets, where it outperforms state-of-the-art methods in localization, detection, and tracking accuracy. The proposed fusion pipeline enables accurate and scalable human motion tracking, making it well-suited for industrial and safety-critical applications. The code is publicly available at https://github.com/PARCO-LAB/COMETH.
CoCoL: A Communication Efficient Decentralized Collaborative Method for Multi-Robot Systems
Huang, Jiaxi, Huang, Yan, Zhao, Yixian, Meng, Wenchao, Xu, Jinming
-- Collaborative learning enhances the performance and adaptability of multi-robot systems in complex tasks but faces significant challenges due to high communication overhead and data heterogeneity inherent in multi-robot tasks. T o this end, we propose CoCoL, a Co mmunication efficient decentralized Co llaborative L earning method tailored for multi-robot systems with heterogeneous local datasets. Leveraging a mirror descent framework, CoCoL achieves remarkable communication efficiency with approximate Newton-type updates by capturing the similarity between objective functions of robots, and reduces computational costs through inexact sub-problem solutions. Furthermore, the integration of a gradient tracking scheme ensures its robustness against data heterogeneity. Experimental results on three representative multi-robot collaborative learning tasks show that the proposed CoCoL can significantly reduce both the number of communication rounds and total bandwidth consumption while maintaining state-of-the-art accuracy. These benefits are particularly evident in challenging scenarios involving non-IID (non-independent and identically distributed) data distribution, streaming data, and time-varying network topologies. I. INTRODUCTION Multi-robot systems offer the ability to tackle complex tasks through proper collaboration with enhanced efficiency, robustness, and flexibility compared to single-robot systems [1]. By sharing information, a team of robots can leverage collective knowledge to make more informed decisions and accomplish tasks in a coordinated manner.
GPT-FT: An Efficient Automated Feature Transformation Using GPT for Sequence Reconstruction and Performance Enhancement
Gao, Yang, Wang, Dongjie, Piersall, Scott, Zhang, Ye, Wang, Liqiang
Feature transformation plays a critical role in enhancing machine learning model performance by optimizing data representations. Recent state-of-the-art approaches address this task as a continuous embedding optimization problem, converting discrete search into a learnable process. Although effective, these methods often rely on sequential encoder-decoder structures that cause high computational costs and parameter requirements, limiting scalability and efficiency. To address these limitations, we propose a novel framework that accomplishes automated feature transformation through four steps: transformation records collection, embedding space construction with a revised Generative Pre-trained Transformer (GPT) model, gradient-ascent search, and autoregressive reconstruction. In our approach, the revised GPT model serves two primary functions: (a) feature transformation sequence reconstruction and (b) model performance estimation and enhancement for downstream tasks by constructing the embedding space. Such a multi-objective optimization framework reduces parameter size and accelerates transformation processes. Experimental results on benchmark datasets show that the proposed framework matches or exceeds baseline performance, with significant gains in computational efficiency. This work highlights the potential of transformer-based architectures for scalable, high-performance automated feature transformation.