Zhang, Xiao-Ping
PSformer: Parameter-efficient Transformer with Segment Attention for Time Series Forecasting
Wang, Yanlong, Xu, Jian, Ma, Fei, Huang, Shao-Lun, Sun, Danny Dongning, Zhang, Xiao-Ping
Time series forecasting remains a critical challenge across various domains, often complicated by high-dimensional data and long-term dependencies. This paper presents a novel transformer architecture for time series forecasting, incorporating two key innovations: parameter sharing (PS) and Spatial-Temporal Segment Attention (SegAtt). We also define the time series segment as the concatenation of sequence patches from the same positions across different variables. The proposed model, PSformer, reduces the number of training parameters through the parameter sharing mechanism, thereby improving model efficiency and scalability. The introduction of SegAtt could enhance the capability of capturing local spatio-temporal dependencies by computing attention over the segments, and improve global representation by integrating information across segments. The combination of parameter sharing and SegAtt significantly improves the forecasting performance. Extensive experiments on benchmark datasets demonstrate that PSformer outperforms popular baselines and other transformer-based approaches in terms of accuracy and scalability, establishing itself as an accurate and scalable tool for time series forecasting. With the advancement of artificial intelligence techniques, significant efforts have been devoted to developing innovative models that continue to improve the prediction performance (Liang et al., 2024; Wang et al., 2024). In particular, the transformer-based model family has recently attracted more attention for its proved success in nature language processing (OpenAI et al., 2024) and computer vision (Liu et al., 2021; Dosovitskiy et al., 2021). Moreover, pre-trained large models based on the transformer architecture have shown advantages in time series forecasting(Liu et al., 2024a; Jin et al., 2024; Chang et al., 2023; Woo et al., 2024), demonstrating that increasing the amount of parameters in transformer models and the volume of training data can effectively enhance the model capability.
When Vision Meets Touch: A Contemporary Review for Visuotactile Sensors from the Signal Processing Perspective
Li, Shoujie, Wang, Zihan, Wu, Changsheng, Li, Xiang, Luo, Shan, Fang, Bin, Sun, Fuchun, Zhang, Xiao-Ping, Ding, Wenbo
Tactile sensors, which provide information about the physical properties of objects, are an essential component of robotic systems. The visuotactile sensing technology with the merits of high resolution and low cost has facilitated the development of robotics from environment exploration to dexterous operation. Over the years, several reviews on visuotactile sensors for robots have been presented, but few of them discussed the significance of signal processing methods to visuotactile sensors. Apart from ingenious hardware design, the full potential of the sensory system toward designated tasks can only be released with the appropriate signal processing methods. Therefore, this paper provides a comprehensive review of visuotactile sensors from the perspective of signal processing methods and outlooks possible future research directions for visuotactile sensors.
A Survey on Efficient Inference for Large Language Models
Zhou, Zixuan, Ning, Xuefei, Hong, Ke, Fu, Tianyu, Xu, Jiaming, Li, Shiyao, Lou, Yuming, Wang, Luning, Yuan, Zhihang, Li, Xiuhong, Yan, Shengen, Dai, Guohao, Zhang, Xiao-Ping, Dong, Yuhan, Wang, Yu
Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
HDC: Hierarchical Semantic Decoding with Counting Assistance for Generalized Referring Expression Segmentation
Luo, Zhuoyan, Wu, Yinghao, Liu, Yong, Xiao, Yicheng, Zhang, Xiao-Ping, Yang, Yujiu
The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving multiple/non-target scenarios. Recent approaches focus on optimizing the last modality-fused feature which is directly utilized for segmentation and object-existence identification. However, the attempt to integrate all-grained information into a single joint representation is impractical in GRES due to the increased complexity of the spatial relationships among instances and deceptive text descriptions. Furthermore, the subsequent binary target justification across all referent scenarios fails to specify their inherent differences, leading to ambiguity in object understanding. To address the weakness, we propose a Hierarchical Semantic Decoding with Counting Assistance framework (HDC). It hierarchically transfers complementary modality information across granularities, and then aggregates each well-aligned semantic correspondence for multi-level decoding. Moreover, with complete semantic context modeling, we endow HDC with explicit counting capability to facilitate comprehensive object perception in multiple/single/non-target settings. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of HDC which outperforms the state-of-the-art GRES methods by a remarkable margin. Code will be available here.
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Yu, Zhaojian, Wu, Yinghao, Deng, Zhuotao, Tang, Yansong, Zhang, Xiao-Ping
In recent years, large-scale auto-regressive models have made significant progress in various tasks, such as text or video generation. However, the environmental impact of these models has been largely overlooked, with a lack of assessment and analysis of their carbon footprint. To address this gap, we introduce OpenCarbonEval, a unified framework for integrating large-scale models across diverse modalities to predict carbon emissions, which could provide AI service providers and users with a means to estimate emissions beforehand and help mitigate the environmental pressure associated with these models. In OpenCarbonEval, we propose a dynamic throughput modeling approach that could capture workload and hardware fluctuations in the training process for more precise emissions estimates. Our evaluation results demonstrate that OpenCarbonEval can more accurately predict training emissions than previous methods, and can be seamlessly applied to different modal tasks. Specifically, we show that OpenCarbonEval achieves superior performance in predicting carbon emissions for both visual models and language models. By promoting sustainable AI development and deployment, OpenCarbonEval can help reduce the environmental impact of large-scale models and contribute to a more environmentally responsible future for the AI community.
FL-TAC: Enhanced Fine-Tuning in Federated Learning via Low-Rank, Task-Specific Adapter Clustering
Ping, Siqi, Mao, Yuzhu, Liu, Yang, Zhang, Xiao-Ping, Ding, Wenbo
Although large-scale pre-trained models hold great potential for adapting to downstream tasks through fine-tuning, the performance of such fine-tuned models is often limited by the difficulty of collecting sufficient high-quality, task-specific data. Federated Learning (FL) offers a promising solution by enabling fine-tuning across large-scale clients with a variety of task data, but it is bottlenecked by significant communication overhead due to the pre-trained models' extensive size. This paper addresses the high communication cost for fine-tuning large pre-trained models within FL frameworks through low-rank fine-tuning. Specifically, we train a low-rank adapter for each individual task on the client side, followed by server-side clustering for similar group of adapters to achieve task-specific aggregation. Extensive experiments on various language and vision tasks, such as GLUE and CIFAR-10/100, reveal the evolution of task-specific adapters throughout the FL training process and verify the effectiveness of the proposed low-rank task-specific adapter clustering (TAC) method. Large-scale pre-trained models, such as Large Language Models (LLMs) trained on extensive data, demonstrate superior performance in natural language processing and remarkable adaptability to various downstream tasks (Brown et al., 2020; Ouyang et al., 2022; Touvron et al., 2023; Zhang et al., 2022; Dosovitskiy et al., 2020; Brohan et al., 2023).
TransformLoc: Transforming MAVs into Mobile Localization Infrastructures in Heterogeneous Swarms
Wang, Haoyang, Xu, Jingao, Zhao, Chenyu, Lu, Zihong, Cheng, Yuhan, Chen, Xuecheng, Zhang, Xiao-Ping, Liu, Yunhao, Chen, Xinlei
A heterogeneous micro aerial vehicles (MAV) swarm consists of resource-intensive but expensive advanced MAVs (AMAVs) and resource-limited but cost-effective basic MAVs (BMAVs), offering opportunities in diverse fields. Accurate and real-time localization is crucial for MAV swarms, but current practices lack a low-cost, high-precision, and real-time solution, especially for lightweight BMAVs. We find an opportunity to accomplish the task by transforming AMAVs into mobile localization infrastructures for BMAVs. However, turning this insight into a practical system is non-trivial due to challenges in location estimation with BMAVs' unknown and diverse localization errors and resource allocation of AMAVs given coupled influential factors. This study proposes TransformLoc, a new framework that transforms AMAVs into mobile localization infrastructures, specifically designed for low-cost and resource-constrained BMAVs. We first design an error-aware joint location estimation model to perform intermittent joint location estimation for BMAVs and then design a proximity-driven adaptive grouping-scheduling strategy to allocate resources of AMAVs dynamically. TransformLoc achieves a collaborative, adaptive, and cost-effective localization system suitable for large-scale heterogeneous MAV swarms. We implement TransformLoc on industrial drones and validate its performance. Results show that TransformLoc outperforms baselines including SOTA up to 68\% in localization performance, motivating up to 60\% navigation success rate improvement.
Dual-modal Tactile E-skin: Enabling Bidirectional Human-Robot Interaction via Integrated Tactile Perception and Feedback
Mu, Shilong, Zhao, Runze, Lin, Zenan, Huang, Yan, Li, Shoujie, Li, Chenchang, Zhang, Xiao-Ping, Ding, Wenbo
To foster an immersive and natural human-robot interaction, the implementation of tactile perception and feedback becomes imperative, effectively bridging the conventional sensory gap. In this paper, we propose a dual-modal electronic skin (e-skin) that integrates magnetic tactile sensing and vibration feedback for enhanced human-robot interaction. The dual-modal tactile e-skin offers multi-functional tactile sensing and programmable haptic feedback, underpinned by a layered structure comprised of flexible magnetic films, soft silicone, a Hall sensor and actuator array, and a microcontroller unit. The e-skin captures the magnetic field changes caused by subtle deformations through Hall sensors, employing deep learning for accurate tactile perception. Simultaneously, the actuator array generates mechanical vibrations to facilitate haptic feedback, delivering diverse mechanical stimuli. Notably, the dual-modal e-skin is capable of transmitting tactile information bidirectionally, enabling object recognition and fine-weighing operations. This bidirectional tactile interaction framework will enhance the immersion and efficiency of interactions between humans and robots.
Social Physics Informed Diffusion Model for Crowd Simulation
Chen, Hongyi, Ding, Jingtao, Li, Yong, Wang, Yue, Zhang, Xiao-Ping
Crowd simulation holds crucial applications in various domains, such as urban planning, architectural design, and traffic arrangement. In recent years, physics-informed machine learning methods have achieved state-of-the-art performance in crowd simulation but fail to model the heterogeneity and multi-modality of human movement comprehensively. In this paper, we propose a social physics-informed diffusion model named SPDiff to mitigate the above gap. SPDiff takes both the interactive and historical information of crowds in the current timeframe to reverse the diffusion process, thereby generating the distribution of pedestrian movement in the subsequent timeframe. Inspired by the well-known social physics model, i.e., Social Force, regarding crowd dynamics, we design a crowd interaction module to guide the denoising process and further enhance this module with the equivariant properties of crowd interactions. To mitigate error accumulation in long-term simulations, we propose a multi-frame rollout training algorithm for diffusion modeling. Experiments conducted on two real-world datasets demonstrate the superior performance of SPDiff in terms of macroscopic and microscopic evaluation metrics. Code and appendix are available at https://github.com/tsinghua-fib-lab/SPDiff.
Few-Shot Class-Incremental Learning with Prior Knowledge
Jiang, Wenhao, Li, Duo, Hu, Menghan, Zhai, Guangtao, Yang, Xiaokang, Zhang, Xiao-Ping
To tackle the issues of catastrophic forgetting and overfitting in few-shot class-incremental learning (FSCIL), previous work has primarily concentrated on preserving the memory of old knowledge during the incremental phase. The role of pre-trained model in shaping the effectiveness of incremental learning is frequently underestimated in these studies. Therefore, to enhance the generalization ability of the pre-trained model, we propose Learning with Prior Knowledge (LwPK) by introducing nearly free prior knowledge from a few unlabeled data of subsequent incremental classes. We cluster unlabeled incremental class samples to produce pseudo-labels, then jointly train these with labeled base class samples, effectively allocating embedding space for both old and new class data. Experimental results indicate that LwPK effectively enhances the model resilience against catastrophic forgetting, with theoretical analysis based on empirical risk minimization and class distance measurement corroborating its operational principles. The source code of LwPK is publicly available at: \url{https://github.com/StevenJ308/LwPK}.