Object-Oriented Architecture
Affordance-Centric Policy Learning: Sample Efficient and Generalisable Robot Policy Learning using Affordance-Centric Task Frames
Rana, Krishan, Abou-Chakra, Jad, Garg, Sourav, Lee, Robert, Reid, Ian, Suenderhauf, Niko
Affordances are central to robotic manipulation, where most tasks can be simplified to interactions with task-specific regions on objects. By focusing on these key regions, we can abstract away task-irrelevant information, simplifying the learning process, and enhancing generalisation. In this paper, we propose an affordance-centric policy-learning approach that centres and appropriately \textit{orients} a \textit{task frame} on these affordance regions allowing us to achieve both \textbf{intra-category invariance} -- where policies can generalise across different instances within the same object category -- and \textbf{spatial invariance} -- which enables consistent performance regardless of object placement in the environment. We propose a method to leverage existing generalist large vision models to extract and track these affordance frames, and demonstrate that our approach can learn manipulation tasks using behaviour cloning from as little as 10 demonstrations, with equivalent generalisation to an image-based policy trained on 305 demonstrations. We provide video demonstrations on our project site: https://affordance-policy.github.io.
Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation
The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene. To successfully accomplish the VON task, two essential conditions must be fulfiled: 1) the user knows the name of the desired object; and 2) the user-specified object actually is present within the scene. To meet these conditions, a simulator can incorporate predefined object names and positions into the metadata of the scene. However, in real-world scenarios, it is often challenging to ensure that these conditions are always met. Humans in an unfamiliar environment may not know which objects are present in the scene, or they may mistakenly specify an object that is not actually present.
VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking
Qian, Zekun, Han, Ruize, Hou, Junhui, Song, Linqi, Feng, Wei
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.
Zero-Shot Semantic Segmentation
Semantic segmentation models are limited in their ability to scale to large numbers of object classes. In this paper, we introduce the new task of zero-shot semantic segmentation: learning pixel-wise classifiers for never-seen object categories with zero training examples. To this end, we present a novel architecture, ZS3Net, combining a deep visual segmentation model with an approach to generate visual representations from semantic word embeddings. By this way, ZS3Net addresses pixel classification tasks where both seen and unseen categories are faced at test time (so called generalized zero-shot classification). Performance is further improved by a self-training step that relies on automatic pseudo-labeling of pixels from unseen classes.
Reviews: Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning
Post rebuttal: I now understand the middle ground this paper is positioned, and the difference to propositional OO representations where you don't necessarily care which instance of an object type you're dealing with, which significantly reduces the dimensionality of learning transition dynamics. But this is still similar to other work on graph neural networks for model learning in fully relational representations, like Relation Networks by Santoro et al., and Interaction Networks by Battaglia et al. which in worst case learn T * n * (n-1) relations for n objects for T types of relations. However, this paper does do a nice job of formalizing from the OO-MDP and Propositional MDP setting as opposed to the two papers I mentioned which do not, and focus on the physical dynamics case. I am willing to increase my score based on this, but still do not think it is novel enough to be accepted. This is very similar to relational MDPs, but they learn transition dynamics in this relational attribute space rather than real state space.
Reviews: Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
An approach for joint estimation of 3D Layout, 3D Object Detection, Camera Pose Estimation and Holistic Scene Understanding' (as defined in Song et al. (2015)) is proposed. More specifically, deep nets, functional mappings (e.g., projections from 3D to 2D points) and loss functions are combined to obtain a holistic interpretation of a scene illustrated in a single RGB image. The proposed approach is shown to outperform 3DGP (Choi et al. (2013)) and IM2CAD (Izadinia et al. (2017)) on the SUN RGB-D dataset. Review Summary: The paper is well written and presents an intuitive approach which is illustrated to work well when compared to two baselines. For some of the tasks, e.g., 3D Layout estimation, stronger baselines exist and as a reviewer/reader I can't assess how the proposed approach compares.
Reviews: Object-Oriented Dynamics Predictor
This paper addresses the problem of action-conditional video prediction via a deep neural network whose architecture specifically aims to represent object positions, relationships, and interactions. The learned models are shown empirically to generalize to novel object configurations and to be robust to minor changes in object appearance. Technical Quality As far as I can tell the paper is technically sound. The experiments are well-designed to support the main claims. I especially appreciated the attempts to study whether the network is truly capturing object-based knowledge as a human might expect (rather than simply being a really fancy pixel - pixel model).
Reviews: Learning Hierarchical Semantic Image Manipulation through Structured Representations
In this paper a new method for image manipulation is proposed. The proposed method incorporates a hierarchical framework and provides both interactive and automatic semantic object-level image manipulation. In the interactive manipulation setting, the user can select a bounding box where image editing for adding and removing objects will be applied. The proposed network architecture consists of a foreground output stream which produces the predictions on binary object mask and a background output stream for producing per-pixel label maps. As the result, the proposed image manipulation method generates output image by filling in the pixel-level textures guided by the semantic layout.
Reviews: Modelling and unsupervised learning of symmetric deformable object categories
Summary: This work propose an approach to model symmetries in deformable object categories in an unsupervised manner. This approach has been demonstrated to work for objects with bilateral symmetry (identifying symmetries in human faces using CelebA dataset, cats using cats head dataset, on cars with a synthetic car dataset), and finally for rotational symmetry on a protein structure. Pros: Overview of the problem and associated challenges. The proposed approach seems a natural way to establish dense correspondences for non-rigid objects given two views of same object category (say example in Figure-3). In my opinion, correspondences for non-rigid/deformable objects is far more important problem than symmetry (with a potential impact on numerous problems including non rigid 3D reconstruction, wide baseline disparity estimation, human analysis etc).
SharpSLAM: 3D Object-Oriented Visual SLAM with Deblurring for Agile Drones
Davletshin, Denis, Zhura, Iana, Cheremnykh, Vladislav, Rybiyanov, Mikhail, Fedoseev, Aleksey, Tsetserukou, Dzmitry
The paper focuses on the algorithm for improving the quality of 3D reconstruction and segmentation in DSP-SLAM by enhancing the RGB image quality. SharpSLAM algorithm developed by us aims to decrease the influence of high dynamic motion on visual object-oriented SLAM through image deblurring, improving all aspects of object-oriented SLAM, including localization, mapping, and object reconstruction. The experimental results revealed noticeable improvement in object detection quality, with F-score increased from 82.9% to 86.2% due to the higher number of features and corresponding map points. The RMSE of signed distance function has also decreased from 17.2 to 15.4 cm. Furthermore, our solution has enhanced object positioning, with an increase in the IoU from 74.5% to 75.7%. SharpSLAM algorithm has the potential to highly improve the quality of 3D reconstruction and segmentation in DSP-SLAM and to impact a wide range of fields, including robotics, autonomous vehicles, and augmented reality.