equirectangular image
Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation
Zayene, Mehdi, Endres, Jannik, Havolli, Albias, Corbière, Charles, Cherkaoui, Salim, Kontouli, Alexandre, Alahi, Alexandre
Despite considerable progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, consisting of 40K frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with diverse lighting conditions. Collected using two 360{\deg} cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with a significantly increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a significant challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, achieving improved performance.
Reviews: Learning Spherical Convolution for Fast Features from 360 Imagery
This paper describes a method to transform networks learned on perspective images to take spherical images as input. This is an important problem as fisheye and 360-degree sensors become more and more ubiquitous but training data is relatively scarce. The method first transforms the network architecture to adapt the filter sizes and pooling operations to convolutions on a equirectangular representation/projection. Next the filters are learned to match the feature responses of the original network when considering the projections to the tangent plane of the respective feature response. The filters are pre-learned layer-by-layer and fine-tuned to output features as similar as possible to the original network projected to the tangent planes. Detection experiments on Pano2Vid and PASCAL demonstrate that the technique performs slightly below the optimal performance using per-pixel tangent projections (however significantly faster) while outperforming several baselines, including cube map projections.
Geometry Fidelity for Spherical Images
Christensen, Anders, Mojab, Nooshin, Patel, Khushman, Ahuja, Karan, Akata, Zeynep, Winther, Ole, Gonzalez-Franco, Mar, Colaco, Andrea
Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fr\'echet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.
FindView: Precise Target View Localization Task for Look Around Agents
Ishikawa, Haruya, Aoki, Yoshimitsu
The field of research aims to create agents that use visual sensors for solving complex tasks or aid humans by learning to perceive, communicate, and act in their environment. Humans in the loop make the goal very difficult since the dynamics of the environment are changeable, and human interactions can lead to unexpected events. Towards better collaboration between agents and humans, agents must be able to perform localization of any point in space that reflects the characteristics of human's perception of 3D space Cirik et al. [2020]. Since the visual sensors for the agents are commonly RGB sensors employed with partial Field-of-View (FoV), we would need to train these agents to perceive how humans see from these views. Communication with these agents will almost always necessitate the agents to navigate to view a common referential FoV in the scene so that the human can instruct the agents with the shared contexts. Challenge arises since the point of interest could be any point in the scene, and many points in the scene will not correspond to easily named objects. So far, many embodied agents being researched use either partial FoVs or directly use panoramic images that are hard for human observers to understand. We believe that embodied agents should be able to look around and localize in various views that human observers might be looking at. We approach this problem by introducing a new task, namely the FindView task, to evaluate and benchmark the agents (Figure 1).
DIGITOUR: Automatic Digital Tours for Real-Estate Properties
Chhikara, Prateek, Kuhar, Harshul, Goyal, Anil, Sharma, Chirag
A virtual or digital tour is a form of virtual reality technology which allows a user to experience a specific location remotely. Currently, these virtual tours are created by following a 2-step strategy. First, a photographer clicks a 360 degree equirectangular image; then, a team of annotators manually links these images for the "walkthrough" user experience. The major challenge in the mass adoption of virtual tours is the time and cost involved in manual annotation/linking of images. Therefore, this paper presents an end-to-end pipeline to automate the generation of 3D virtual tours using equirectangular images for real-estate properties. We propose a novel HSV-based coloring scheme for paper tags that need to be placed at different locations before clicking the equirectangular images using 360 degree cameras. These tags have two characteristics: i) they are numbered to help the photographer for placement of tags in sequence and; ii) bi-colored, which allows better learning of tag detection (using YOLOv5 architecture) in an image and digit recognition (using custom MobileNet architecture) tasks. Finally, we link/connect all the equirectangular images based on detected tags. We show the efficiency of the proposed pipeline on a real-world equirectangular image dataset collected from the Housing.com database.
Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds
Zhao, Zhipeng, Yu, Huai, Lyv, Chenwei, Yang, Wen, Scherer, Sebastian
Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.
PanoFlow: Learning 360{\deg} Optical Flow for Surrounding Temporal Understanding
Shi, Hao, Zhou, Yifan, Yang, Kailun, Yin, Xiaoting, Wang, Ze, Ye, Yaozu, Yin, Zhe, Meng, Shi, Li, Peng, Wang, Kaiwei
Optical flow estimation is a basic task in self-driving and robotics systems, which enables to temporally interpret traffic scenes. Autonomous vehicles clearly benefit from the ultra-wide Field of View (FoV) offered by 360{\deg} panoramic sensors. However, due to the unique imaging process of panoramic cameras, models designed for pinhole images do not directly generalize satisfactorily to 360{\deg} panoramic images. In this paper, we put forward a novel network framework--PanoFlow, to learn optical flow for panoramic images. To overcome the distortions introduced by equirectangular projection in panoramic transformation, we design a Flow Distortion Augmentation (FDA) method, which contains radial flow distortion (FDA-R) or equirectangular flow distortion (FDA-E). We further look into the definition and properties of cyclic optical flow for panoramic videos, and hereby propose a Cyclic Flow Estimation (CFE) method by leveraging the cyclicity of spherical images to infer 360{\deg} optical flow and converting large displacement to relatively small displacement. PanoFlow is applicable to any existing flow estimation method and benefits from the progress of narrow-FoV flow estimation. In addition, we create and release a synthetic panoramic dataset FlowScape based on CARLA to facilitate training and quantitative analysis. PanoFlow achieves state-of-the-art performance on the public OmniFlowNet and the established FlowScape benchmarks. Our proposed approach reduces the End-Point-Error (EPE) on FlowScape by 27.3%. On OmniFlowNet, PanoFlow achieves a 55.5% error reduction from the best published result. We also qualitatively validate our method via a collection vehicle and a public real-world OmniPhotos dataset, indicating strong potential and robustness for real-world navigation applications. Code and dataset are publicly available at https://github.com/MasterHow/PanoFlow.