pose estimation network
No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation
Sung, Mingyu, Choe, Hyeonmin, Kim, Il-Min, Yun, Sangseok, Kang, Jae Mo
Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii) instance-aware image masking. Specifically, PITTA enables highly effective TTA on a pretrained MDE network in a pose-agnostic manner without resorting to any camera pose information. Besides, our instance-aware masking strategy extracts instance-wise masks for dynamic objects (e.g., vehicles, pedestrians, etc.) from a segmentation mask produced by a pretrained panoptic segmentation network, by removing static objects including background components. To further boost performance, we also present a simple yet effective edge extraction methodology for the input image (i.e., a single monocular image) and depth map. Extensive experimental evaluations on DrivingStereo and Waymo datasets with varying environmental conditions demonstrate that our proposed framework, PITTA, surpasses the existing state-of-the-art techniques with remarkable performance improvements in MDE during TTA.
Semi-SD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving
Xie, Yusen, Huang, Zhengmin, Shen, Shaojie, Ma, Jun
In this paper, we introduce Semi-SD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code will be available on https://github.com/xieyuser/Semi-SD.
Increasing the Task Flexibility of Heavy-Duty Manipulators Using Visual 6D Pose Estimation of Objects
Mäkinen, Petri, Mustalahti, Pauli, Kivelä, Tuomo, Mattila, Jouni
Recent advances in visual 6D pose estimation of objects using deep neural networks have enabled novel ways of vision-based control for heavy-duty robotic applications. In this study, we present a pipeline for the precise tool positioning of heavy-duty, long-reach (HDLR) manipulators using advanced machine vision. A camera is utilized in the so-called eye-in-hand configuration to estimate directly the poses of a tool and a target object of interest (OOI). Based on the pose error between the tool and the target, along with motion-based calibration between the camera and the robot, precise tool positioning can be reliably achieved using conventional robotic modeling and control methods prevalent in the industry. The proposed methodology comprises orientation and position alignment based on the visually estimated OOI poses, whereas camera-to-robot calibration is conducted based on motion utilizing visual SLAM. The methods seek to avert the inaccuracies resulting from rigid-body--based kinematics of structurally flexible HDLR manipulators via image-based algorithms. To train deep neural networks for OOI pose estimation, only synthetic data are utilized. The methods are validated in a real-world setting using an HDLR manipulator with a 5 m reach. The experimental results demonstrate that an image-based average tool positioning error of less than 2 mm along the non-depth axes is achieved, which facilitates a new way to increase the task flexibility and automation level of non-rigid HDLR manipulators.
Towards real-time 6D pose estimation of objects in single-view cone-beam X-ray
Viviers, Christiaan G. A., de Bruijn, Joel, Filatova, Lena, de With, Peter H. N., van der Sommen, Fons
Deep learning-based pose estimation algorithms can successfully estimate the pose of objects in an image, especially in the field of color images. 6D Object pose estimation based on deep learning models for X-ray images often use custom architectures that employ extensive CAD models and simulated data for training purposes. Recent RGB-based methods opt to solve pose estimation problems using small datasets, making them more attractive for the X-ray domain where medical data is scarcely available. We refine an existing RGB-based model (SingleShotPose) to estimate the 6D pose of a marked cube from grayscale X-ray images by creating a generic solution trained on only real X-ray data and adjusted for X-ray acquisition geometry. The model regresses 2D control points and calculates the pose through 2D/3D correspondences using Perspective-n-Point(PnP), allowing a single trained model to be used across all supporting cone-beam-based X-ray geometries. Since modern X-ray systems continuously adjust acquisition parameters during a procedure, it is essential for such a pose estimation network to consider these parameters in order to be deployed successfully and find a real use case. With a 5-cm/5-degree accuracy of 93% and an average 3D rotation error of 2.2 degrees, the results of the proposed approach are comparable with state-of-the-art alternatives, while requiring significantly less real training examples and being applicable in real-time applications.
Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for Robotic Bin Picking
Chen, Kai, Cao, Rui, James, Stephen, Li, Yichuan, Liu, Yun-Hui, Abbeel, Pieter, Dou, Qi
However, it is burdensome to annotate a customized dataset associated with each specific bin-picking scenario for training pose estimation models. In this paper, we propose an iterative self-training framework for sim-to-real 6D object pose estimation to facilitate cost-effective robotic grasping. Given a bin-picking scenario, we establish a photo-realistic simulator to synthesize abundant virtual data, and use this to train an initial pose estimation network. This network then takes the role of a teacher model, which generates pose predictions for unlabeled real data. With these predictions, we further design a comprehensive adaptive selection scheme to distinguish reliable results, and leverage them as pseudo labels to update a student model for pose estimation on real data. To continuously improve the quality of pseudo labels, we iterate the above steps by taking the trained student model as a new teacher and re-label real data using the refined teacher model. We evaluate our method on a public benchmark and our newly-released dataset, achieving an ADD(-S) improvement of 11.49% and 22.62% respectively. Our method is also able to improve robotic bin-picking success by 19.54%, demonstrating the potential of iterative sim-to-real solutions for robotic applications.
Adversarial Attacks on Monocular Pose Estimation
Chawla, Hemang, Varma, Arnav, Arani, Elahe, Zonooz, Bahram
Advances in deep learning have resulted in steady progress in computer vision with improved accuracy on tasks such as object detection and semantic segmentation. Nevertheless, deep neural networks are vulnerable to adversarial attacks, thus presenting a challenge in reliable deployment. Two of the prominent tasks in 3D scene-understanding for robotics and advanced drive assistance systems are monocular depth and pose estimation, often learned together in an unsupervised manner. While studies evaluating the impact of adversarial attacks on monocular depth estimation exist, a systematic demonstration and analysis of adversarial perturbations against pose estimation are lacking. We show how additive imperceptible perturbations can not only change predictions to increase the trajectory drift but also catastrophically alter its geometry. We also study the relation between adversarial perturbations targeting monocular depth and pose estimation networks, as well as the transferability of perturbations to other networks with different architectures and losses. Our experiments show how the generated perturbations lead to notable errors in relative rotation and translation predictions and elucidate vulnerabilities of the networks.
Sim2Real Instance-Level Style Transfer for 6D Pose Estimation
Ikeda, Takuya, Tanishige, Suomi, Amma, Ayako, Sudano, Michael, Audren, Hervé, Nishiwaki, Koichi
In recent years, synthetic data has been widely used in the training of 6D pose estimation networks, in part because it automatically provides perfect annotation at low cost. However, there are still non-trivial domain gaps, such as differences in textures/materials, between synthetic and real data. These gaps have a measurable impact on performance. To solve this problem, we introduce a simulation to reality (sim2real) instance-level style transfer for 6D pose estimation network training. Our approach transfers the style of target objects individually, from synthetic to real, without human intervention. This improves the quality of synthetic data for training pose estimation networks. We also propose a complete pipeline from data collection to the training of a pose estimation network and conduct extensive evaluation on a real-world robotic platform. Our evaluation shows significant improvement achieved by our method in both pose estimation performance and the realism of images adapted by the style transfer.
Bangla sign language recognition using concatenated BdSL network
Abedin, Thasin, Prottoy, Khondokar S. S., Moshruba, Ayana, Hakim, Safayat Bin
Sign language is the only medium of communication for the hearing impaired and the deaf and dumb community. Communication with the general mass is thus always a challenge for this minority group. Especially in Bangla sign language (BdSL), there are 38 alphabets with some having nearly identical symbols. As a result, in BdSL recognition, the posture of hand is an important factor in addition to visual features extracted from traditional Convolutional Neural Network (CNN). In this paper, a novel architecture "Concatenated BdSL Network" is proposed which consists of a CNN based image network and a pose estimation network. While the image network gets the visual features, the relative positions of hand keypoints are taken by the pose estimation network to obtain the additional features to deal with the complexity of the BdSL symbols. A score of 91.51% was achieved by this novel approach in test set and the effectiveness of the additional pose estimation network is suggested by the experimental results.
Geometric Change Detection in Digital Twins using 3D Machine Learning
Sundby, Tiril, Graham, Julia Maria, Rasheed, Adil, Tabib, Mandar, San, Omer
Digital twins are meant to bridge the gap between real-world physical systems and virtual representations. Both stand-alone and descriptive digital twins incorporate 3D geometric models, which are the physical representations of objects in the digital replica. Digital twin applications are required to rapidly update internal parameters with the evolution of their physical counterpart. Due to an essential need for having high-quality geometric models for accurate physical representations, the storage and bandwidth requirements for storing 3D model information can quickly exceed the available storage and bandwidth capacity. In this work, we demonstrate a novel approach to geometric change detection in the context of a digital twin. We address the issue through a combined solution of Dynamic Mode Decomposition (DMD) for motion detection, YOLOv5 for object detection, and 3D machine learning for pose estimation. DMD is applied for background subtraction, enabling detection of moving foreground objects in real-time. The video frames containing detected motion are extracted and used as input to the change detection network. The object detection algorithm YOLOv5 is applied to extract the bounding boxes of detected objects in the video frames. Furthermore, the rotational pose of each object is estimated in a 3D pose estimation network. A series of convolutional neural networks conducts feature extraction from images and 3D model shapes. Then, the network outputs the estimated Euler angles of the camera orientation with respect to the object in the input image. By only storing data associated with a detected change in pose, we minimize necessary storage and bandwidth requirements while still being able to recreate the 3D scene on demand.