Ye, Yaozu
Rethinking Event-based Human Pose Estimation with 3D Event Representations
Yin, Xiaoting, Shi, Hao, Chen, Jiaan, Wang, Ze, Ye, Yaozu, Ni, Huajian, Yang, Kailun, Wang, Kaiwei
Human pose estimation is a fundamental and appealing task in computer vision. Traditional frame-based cameras and videos are commonly applied, yet, they become less reliable in scenarios under high dynamic range or heavy motion blur. In contrast, event cameras offer a robust solution for navigating these challenging contexts. Predominant methodologies incorporate event cameras into learning frameworks by accumulating events into event frames. However, such methods tend to marginalize the intrinsic asynchronous and high temporal resolution characteristics of events. This disregard leads to a loss in essential temporal dimension data, crucial for discerning distinct actions. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC collates events within concise temporal slices at identical positions, preserving 3D attributes with statistical cues and markedly mitigating memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. On the public real-world DHP19 dataset, our event point cloud technique excels in real-time mobile predictions, while the decoupled event voxel method achieves the highest accuracy. Experiments on EV-3DPW demonstrate that the robustness of our proposed 3D representation methods compared to traditional RGB images and event frame techniques under the same backbones. Our code and dataset have been made publicly available at https://github.com/MasterHow/EventPointPose.
Towards Anytime Optical Flow Estimation with Event Cameras
Ye, Yaozu, Shi, Hao, Yang, Kailun, Wang, Ze, Yin, Xiaoting, Lin, Yining, Liu, Mao, Wang, Yaonan, Wang, Kaiwei
Optical flow estimation is a fundamental task in the field of autonomous driving. Event cameras are capable of responding to log-brightness changes in microseconds. Its characteristic of producing responses only to the changing region is particularly suitable for optical flow estimation. In contrast to the super low-latency response speed of event cameras, existing datasets collected via event cameras, however, only provide limited frame rate optical flow ground truth, (e.g., at 10Hz), greatly restricting the potential of event-driven optical flow. To address this challenge, we put forward a high-frame-rate, low-latency event representation Unified Voxel Grid, sequentially fed into the network bin by bin. We then propose EVA-Flow, an EVent-based Anytime Flow estimation network to produce high-frame-rate event optical flow with only low-frame-rate optical flow ground truth for supervision. The key component of our EVA-Flow is the stacked Spatiotemporal Motion Refinement (SMR) module, which predicts temporally dense optical flow and enhances the accuracy via spatial-temporal motion refinement. The time-dense feature warping utilized in the SMR module provides implicit supervision for the intermediate optical flow. Additionally, we introduce the Rectified Flow Warp Loss (RFWL) for the unsupervised evaluation of intermediate optical flow in the absence of ground truth. This is, to the best of our knowledge, the first work focusing on anytime optical flow estimation via event cameras. A comprehensive variety of experiments on MVSEC, DESC, and our EVA-FlowSet demonstrates that EVA-Flow achieves competitive performance, super-low-latency (5ms), fastest inference (9.2ms), time-dense motion estimation (200Hz), and strong generalization. Our code will be available at https://github.com/Yaozhuwa/EVA-Flow.
FocusFlow: Boosting Key-Points Optical Flow Estimation for Autonomous Driving
Yi, Zhonghua, Shi, Hao, Yang, Kailun, Jiang, Qi, Ye, Yaozu, Wang, Ze, Ni, Huajian, Wang, Kaiwei
Key-point-based scene understanding is fundamental for autonomous driving applications. At the same time, optical flow plays an important role in many vision tasks. However, due to the implicit bias of equal attention on all points, classic data-driven optical flow estimation methods yield less satisfactory performance on key points, limiting their implementations in key-point-critical safety-relevant scenarios. To address these issues, we introduce a points-based modeling method that requires the model to learn key-point-related priors explicitly. Based on the modeling method, we present FocusFlow, a framework consisting of 1) a mix loss function combined with a classic photometric loss function and our proposed Conditional Point Control Loss (CPCL) function for diverse point-wise supervision; 2) a conditioned controlling model which substitutes the conventional feature encoder by our proposed Condition Control Encoder (CCE). CCE incorporates a Frame Feature Encoder (FFE) that extracts features from frames, a Condition Feature Encoder (CFE) that learns to control the feature extraction behavior of FFE from input masks containing information of key points, and fusion modules that transfer the controlling information between FFE and CFE. Our FocusFlow framework shows outstanding performance with up to +44.5% precision improvement on various key points such as ORB, SIFT, and even learning-based SiLK, along with exceptional scalability for most existing data-driven optical flow methods like PWC-Net, RAFT, and FlowFormer. Notably, FocusFlow yields competitive or superior performances rivaling the original models on the whole frame. The source code will be available at https://github.com/ZhonghuaYi/FocusFlow_official.
FishDreamer: Towards Fisheye Semantic Completion via Unified Image Outpainting and Segmentation
Shi, Hao, Li, Yu, Yang, Kailun, Zhang, Jiaming, Peng, Kunyu, Roitberg, Alina, Ye, Yaozu, Ni, Huajian, Wang, Kaiwei, Stiefelhagen, Rainer
This paper raises the new task of Fisheye Semantic Completion (FSC), where dense texture, structure, and semantics of a fisheye image are inferred even beyond the sensor field-of-view (FoV). Fisheye cameras have larger FoV than ordinary pinhole cameras, yet its unique special imaging model naturally leads to a blind area at the edge of the image plane. This is suboptimal for safety-critical applications since important perception tasks, such as semantic segmentation, become very challenging within the blind zone. Previous works considered the out-FoV outpainting and in-FoV segmentation separately. However, we observe that these two tasks are actually closely coupled. To jointly estimate the tightly intertwined complete fisheye image and scene semantics, we introduce the new FishDreamer which relies on successful ViTs enhanced with a novel Polar-aware Cross Attention module (PCA) to leverage dense context and guide semantically-consistent content generation while considering different polar distributions. In addition to the contribution of the novel task and architecture, we also derive Cityscapes-BF and KITTI360-BF datasets to facilitate training and evaluation of this new track. Our experiments demonstrate that the proposed FishDreamer outperforms methods solving each task in isolation and surpasses alternative approaches on the Fisheye Semantic Completion. Code and datasets are publicly available at https://github.com/MasterHow/FishDreamer.
PanoFlow: Learning 360{\deg} Optical Flow for Surrounding Temporal Understanding
Shi, Hao, Zhou, Yifan, Yang, Kailun, Yin, Xiaoting, Wang, Ze, Ye, Yaozu, Yin, Zhe, Meng, Shi, Li, Peng, Wang, Kaiwei
Optical flow estimation is a basic task in self-driving and robotics systems, which enables to temporally interpret traffic scenes. Autonomous vehicles clearly benefit from the ultra-wide Field of View (FoV) offered by 360{\deg} panoramic sensors. However, due to the unique imaging process of panoramic cameras, models designed for pinhole images do not directly generalize satisfactorily to 360{\deg} panoramic images. In this paper, we put forward a novel network framework--PanoFlow, to learn optical flow for panoramic images. To overcome the distortions introduced by equirectangular projection in panoramic transformation, we design a Flow Distortion Augmentation (FDA) method, which contains radial flow distortion (FDA-R) or equirectangular flow distortion (FDA-E). We further look into the definition and properties of cyclic optical flow for panoramic videos, and hereby propose a Cyclic Flow Estimation (CFE) method by leveraging the cyclicity of spherical images to infer 360{\deg} optical flow and converting large displacement to relatively small displacement. PanoFlow is applicable to any existing flow estimation method and benefits from the progress of narrow-FoV flow estimation. In addition, we create and release a synthetic panoramic dataset FlowScape based on CARLA to facilitate training and quantitative analysis. PanoFlow achieves state-of-the-art performance on the public OmniFlowNet and the established FlowScape benchmarks. Our proposed approach reduces the End-Point-Error (EPE) on FlowScape by 27.3%. On OmniFlowNet, PanoFlow achieves a 55.5% error reduction from the best published result. We also qualitatively validate our method via a collection vehicle and a public real-world OmniPhotos dataset, indicating strong potential and robustness for real-world navigation applications. Code and dataset are publicly available at https://github.com/MasterHow/PanoFlow.