intermediate representation
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.68)
- North America > Canada > Ontario > Toronto (0.28)
- Asia (0.04)
- North America > United States > Alabama (0.04)
- North America > Canada (0.04)
- Education (0.68)
- Leisure & Entertainment > Sports (0.46)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Europe > France (0.04)
- North America > United States > Michigan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Wisconsin (0.04)
- (2 more...)
- Information Technology (0.46)
- Health & Medicine (0.46)
Residual Alignment: Uncovering the Mechanisms of Residual Networks
The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions.
PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points
Detecting 3D objects from a single RGB image is intrinsically ambiguous, thus requiring appropriate prior knowledge and intermediate representations as constraints to reduce the uncertainties and improve the consistencies between the 2D image plane and the 3D world coordinate. To address this challenge, we propose to adopt perspective points as a new intermediate representation for 3D object detection, defined as the 2D projections of local Manhattan 3D keypoints to locate an object; these perspective points satisfy geometric constraints imposed by the perspective projection. We further devise PerspectiveNet, an end-to-end trainable model that simultaneously detects the 2D bounding box, 2D perspective points, and 3D object bounding box for each object from a single RGB image. PerspectiveNet yields three unique advantages: (i) 3D object bounding boxes are estimated based on perspective points, bridging the gap between 2D and 3D bounding boxes without the need of category-specific 3D shape priors.
Rethinking Intermediate Representation for VLM-based Robot Manipulation
Tang, Weiliang, Gao, Jialin, Pan, Jia-Hui, Wang, Gang, Li, Li Erran, Liu, Yunhui, Ding, Mingyu, Heng, Pheng-Ann, Fu, Chi-Wing
Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Y et, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar . Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects.
- North America > Mexico > Gulf of Mexico (1.00)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)