Wang, Jiahao
Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape
Xu, Jiacong, Zhang, Yi, Peng, Jiawei, Ma, Wufei, Jesslen, Artur, Ji, Pengliang, Hu, Qixin, Zhang, Jiehua, Liu, Qihao, Wang, Jiahao, Ji, Wei, Wang, Chen, Yuan, Xiaoding, Kaushik, Prakhar, Zhang, Guofeng, Liu, Jie, Xie, Yushan, Cui, Yawen, Yuille, Alan, Kortylewski, Adam
Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estimation models. Our experimental results demonstrate that predicting the 3D shape and pose of animals across species remains a very challenging task, despite significant advances in human pose estimation. Our results further demonstrate that synthetic pre-training is a viable strategy to boost the model performance. Overall, Animal3D opens new directions for facilitating future research in animal 3D pose and shape estimation, and is publicly available.
4D Millimeter-Wave Radar in Autonomous Driving: A Survey
Han, Zeyu, Wang, Jiahao, Xu, Zikun, Yang, Shuocheng, He, Lei, Xu, Shaobing, Wang, Jianqiang
The 4D millimeter-wave (mmWave) radar, capable of measuring the range, azimuth, elevation, and velocity of targets, has attracted considerable interest in the autonomous driving community. This is attributed to its robustness in extreme environments and outstanding velocity and elevation measurement capabilities. However, despite the rapid development of research related to its sensing theory and application, there is a notable lack of surveys on the topic of 4D mmWave radar. To address this gap and foster future research in this area, this paper presents a comprehensive survey on the use of 4D mmWave radar in autonomous driving. Reviews on the theoretical background and progress of 4D mmWave radars are presented first, including the signal processing flow, resolution improvement ways, extrinsic calibration process, and point cloud generation methods. Then it introduces related datasets and application algorithms in autonomous driving perception and localization and mapping tasks. Finally, this paper concludes by predicting future trends in the field of 4D mmWave radar. To the best of our knowledge, this is the first survey specifically for the 4D mmWave radar.
RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer
Wang, Jiahao, Zhang, Songyang, Liu, Yong, Wu, Taiqiang, Yang, Yujiu, Liu, Xihui, Chen, Kai, Luo, Ping, Lin, Dahua
This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design. Project page: https://techmonsterwang.github.io/RIFormer/.
Edge-free but Structure-aware: Prototype-Guided Knowledge Distillation from GNNs to MLPs
Wu, Taiqiang, Zhao, Zhe, Wang, Jiahao, Bai, Xingyu, Wang, Lei, Wong, Ngai, Yang, Yujiu
Distilling high-accuracy Graph Neural Networks~(GNNs) to low-latency multilayer perceptrons~(MLPs) on graph tasks has become a hot research topic. However, MLPs rely exclusively on the node features and fail to capture the graph structural information. Previous methods address this issue by processing graph edges into extra inputs for MLPs, but such graph structures may be unavailable for various scenarios. To this end, we propose a Prototype-Guided Knowledge Distillation~(PGKD) method, which does not require graph edges~(edge-free) yet learns structure-aware MLPs. Specifically, we analyze the graph structural information in GNN teachers, and distill such information from GNNs to MLPs via prototypes in an edge-free setting. Experimental results on popular graph benchmarks demonstrate the effectiveness and robustness of the proposed PGKD.
The Multi-Modal Video Reasoning and Analyzing Competition
Peng, Haoran, Huang, He, Xu, Li, Li, Tianjiao, Liu, Jun, Rahmani, Hossein, Ke, Qiuhong, Guo, Zhicheng, Wu, Cong, Li, Rongchang, Ye, Mang, Wang, Jiahao, Zhang, Jiaxu, Liu, Yuanzhong, He, Tao, Zhang, Fuwei, Liu, Xianbin, Lin, Tao
In this paper, we introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021. This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification, which are based on two datasets: SUTD-TrafficQA and UAV-Human. We summarize the top-performing methods submitted by the participants in this competition and show their results achieved in the competition.