Shen, Liang
EDENet: Echo Direction Encoding Network for Place Recognition Based on Ground Penetrating Radar
Zhang, Pengyu, Chen, Xieyuanli, Chen, Yuwei, Bi, Beizhen, Xu, Zhuo, Jin, Tian, Huang, Xiaotao, Shen, Liang
Ground penetrating radar (GPR) based localization has gained significant recognition in robotics due to its ability to detect stable subsurface features, offering advantages in environments where traditional sensors like cameras and LiDAR may struggle. However, existing methods are primarily focused on small-scale place recognition (PR), leaving the challenges of PR in large-scale maps unaddressed. These challenges include the inherent sparsity of underground features and the variability in underground dielectric constants, which complicate robust localization. In this work, we investigate the geometric relationship between GPR echo sequences and underground scenes, leveraging the robustness of directional features to inform our network design. We introduce learn-able Gabor filters for the precise extraction of directional responses, coupled with a direction-aware attention mechanism for effective geometric encoding. To further enhance performance, we incorporate a shift-invariant unit and a multi-scale aggregation strategy to better accommodate variations in dielectric constants. Experiments conducted on public datasets demonstrate that our proposed EDENet not only surpasses existing solutions in terms of PR performance but also offers advantages in model size and computational efficiency.
Reinforced Decoder: Towards Training Recurrent Neural Networks for Time Series Forecasting
Sima, Qi, Zhang, Xinze, Bao, Yukun, Yang, Siyue, Shen, Liang
Abstract--Recurrent neural network-based sequence-tosequence models have been extensively applied for multi-stepahead time series forecasting. These models typically involve a decoder trained using either its previous forecasts or the actual observed values as the decoder inputs. However, relying on self-generated predictions can lead to the rapid accumulation of errors over multiple steps, while using the actual observations introduces exposure bias as these values are unavailable during the extrapolation stage. In this regard, this study proposes a novel training approach called reinforced decoder, which introduces auxiliary models to generate alternative decoder inputs that remain accessible when extrapolating. Additionally, a reinforcement learning algorithm is utilized to dynamically select the optimal inputs to improve accuracy. ULTI-STEP-AHEAD time series prediction, which involves extrapolating a sequence of future values based extrapolating process, i.e., feeding back the one-step-ahead on historical observations, plays a vital role in various realworld prediction to the decoder to predict the value at the next step. Accordingly, research efforts have been devoted to developing statistical some non-autoregressive architectures were proposed and machine learning techniques for multi-step-ahead time to obviate the error propagation issue [10], [16], [17].
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
Shen, Liang, Wu, Zhihua, Gong, WeiBao, Hao, Hongxiang, Bai, Yangfan, Wu, HuaChao, Wu, Xinxuan, Bian, Jiang, Xiong, Haoyi, Yu, Dianhai, Ma, Yanjun
With the increasing diversity of ML infrastructures nowadays, distributed training over heterogeneous computing systems is desired to facilitate the production of big models. Mixture-of-Experts (MoE) models have been proposed to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present SE-MoE that proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate SE-MoE, where SE-MoE successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that SE-MoE outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, SE-MoE achieved 64% higher throughput with 18% lower memory footprints. The code of the framework will be released on: https://github.com/PaddlePaddle/Paddle.
End-to-end Adaptive Distributed Training on PaddlePaddle
Ao, Yulong, Wu, Zhihua, Yu, Dianhai, Gong, Weibao, Kui, Zhiqing, Zhang, Minxu, Ye, Zilingfeng, Shen, Liang, Ma, Yanjun, Wu, Tian, Wang, Haifeng, Zeng, Wei, Yang, Chao
Distributed training has become a pervasive and effective approach for training a large neural network (NN) model with processing massive data. However, it is very challenging to satisfy requirements from various NN models, diverse computing resources, and their dynamic changes during a training job. In this study, we design our distributed training framework in a systematic end-to-end view to provide the built-in adaptive ability for different scenarios, especially for industrial applications and production environments, by fully considering resource allocation, model partition, task placement, and distributed execution. Based on the unified distributed graph and the unified cluster object, our adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic distributed training. The experiments demonstrate that our framework can satisfy various requirements from the diversity of applications and the heterogeneity of resources with highly competitive performance. The ERNIE language model with 260 billion parameters is efficiently trained on thousands of AI processors with 91.7% weak scalability. The throughput of the model from the recommender system by employing the heterogeneous pipeline asynchronous execution can be increased up to 2.1 times and 3.3 times that of the GPU-only and CPU-only training respectively. Moreover, the fault-tolerant and elastic distributed training have been successfully applied to the online industrial applications, which give a reduction of 34.49% in the number of failed long-term training jobs and an increase of 33.91% for the global scheduling efficiency in the production environment.
Low-Power Computer Vision: Status, Challenges, Opportunities
Alyamkin, Sergei, Ardi, Matthew, Berg, Alexander C., Brighton, Achille, Chen, Bo, Chen, Yiran, Cheng, Hsin-Pai, Fan, Zichen, Feng, Chen, Fu, Bo, Gauen, Kent, Goel, Abhinav, Goncharenko, Alexander, Guo, Xuyang, Ha, Soonhoi, Howard, Andrew, Hu, Xiao, Huang, Yuanjun, Kang, Donghyun, Kim, Jaeyoun, Ko, Jong Gook, Kondratyev, Alexander, Lee, Junhyeok, Lee, Seungjae, Lee, Suwoong, Li, Zichao, Liang, Zhiyu, Liu, Juzheng, Liu, Xin, Lu, Yang, Lu, Yung-Hsiang, Malik, Deeptanshu, Nguyen, Hong Hanh, Park, Eunbyung, Repin, Denis, Shen, Liang, Sheng, Tao, Sun, Fei, Svitov, David, Thiruvathukal, George K., Zhang, Baiwu, Zhang, Jingchi, Zhang, Xiaopeng, Zhuo, Shaojie
Computer vision has achieved impressive progress in recent years. Meanwhile, mobile phones have become the primary computing platforms for millions of people. In addition to mobile phones, many autonomous systems rely on visual data for making decisions and some of these systems have limited energy (such as unmanned aerial vehicles also called drones and mobile robots). These systems rely on batteries and energy efficiency is critical. This article serves two main purposes: (1) Examine the state-of-the-art for low-power solutions to detect objects in images. Since 2015, the IEEE Annual International Low-Power Image Recognition Challenge (LPIRC) has been held to identify the most energy-efficient computer vision solutions. This article summarizes 2018 winners' solutions. (2) Suggest directions for research as well as opportunities for low-power computer vision.