Spatial Reasoning
Neural Representation of Multi-Dimensional Stimuli
Spatial information comes in two forms: direct spatial information (for example, retinal position) and indirect temporal contiguity information, since objects encountered sequentially are in general spatially close. The acquisition of spatial information by a neural network is investigated here. Given a spatial layout of several objects, networks are trained on a prediction task. Networks using temporal sequences with no direct spa(cid:173) tial information are found to develop internal representations that show distances correlated with distances in the external layout. The influence of spatial information is analyzed by providing direct spatial information to the system during training that is either consistent with the layout or inconsistent with it.
Effects of Spatial and Temporal Contiguity on the Acquisition of Spatial Information
Spatial information comes in two forms: direct spatial information (for example, retinal position) and indirect temporal contiguity information, since objects encountered sequentially are in general spatially close. The acquisition of spatial information by a neural network is investigated here. Given a spatial layout of several objects, networks are trained on a prediction task. Networks using temporal sequences with no direct spa(cid:173) tial information are found to develop internal representations that show distances correlated with distances in the external layout. The influence of spatial information is analyzed by providing direct spatial information to the system during training that is either consistent with the layout or inconsistent with it.
A New Model of Spatial Representation in Multimodal Brain Areas
Most models of spatial representations in the cortex assume cells with limited receptive fields that are defined in a particular egocen(cid:173) tric frame of reference. However, cells outside of primary sensory cortex are either gain modulated by postural input or partially shifting. We show that solving classical spatial tasks, like sen(cid:173) sory prediction, multi-sensory integration, sensory-motor transfor(cid:173) mation and motor control requires more complicated intermediate representations that are not invariant in one frame of reference. We present an iterative basis function map that performs these spatial tasks optimally with gain modulated and partially shifting units, and tests it against neurophysiological and neuropsycholog(cid:173) ical data. In order to perform an action directed toward an object, it is necessary to have a representation of its spatial location.
Efficient Bregman Range Search
We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe the first algorithm for range search for an arbitrary Bregman divergence.
Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces
There has been a recent push in extraction of 3D spatial layout of scenes. In this paper, we argue for a parametric representation of objects in 3D, which allows us to incorporate volumetric constraints of the physical world. We show that augmenting current structured prediction techniques with volumetric reasoning significantly improves the performance of the state-of-the-art.
Persistent Homology for Learning Densities with Bounded Support
We present a novel method for learning densities with bounded support which enables us to incorporate hard' topological constraints. In particular, we show how emerging techniques from computational algebraic topology and the notion of Persistent Homology can be combined with kernel based methods from Machine Learning for the purpose of density estimation. The proposed formalism facilitates learning of models with bounded support in a principled way, and -- by incorporating Persistent Homology techniques in our approach -- we are able to encode algebraic-topological constraints which are not addressed in current state-of the art probabilistic models. We study the behaviour of our method on two synthetic examples for various sample sizes and exemplify the benefits of the proposed approach on a real-world data-set by learning a motion model for a racecar. We show how to learn a model which respects the underlying topological structure of the racetrack, constraining the trajectories of the car.
Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning
Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but typically cannot handle complex instructions involving spatial relations among multiple objects. which require both reasoning object-level spatial relations and learning precise pixel-level manipulation affordances. We take an initial step to this challenge with a decoupled two-stage solution. In the first stage, we propose an object-centric semantic-spatial reasoner to select which objects are relevant for the language instructed task. The segmentation of selected objects are then fused as additional input to the affordance learning stage. Simply incorporating the inductive bias of relevant objects to a vision-language affordance learning agent can effectively boost its performance in a custom testbed designed for object manipulation with spatial-related language instructions.
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification
Zhang, Jiazhao, Dai, Liu, Meng, Fanpeng, Fan, Qingnan, Chen, Xuelin, Xu, Kai, Wang, He
Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets, while requiring (up to 30x) less computational cost for training.
ProContEXT: Exploring Progressive Context Transformer for Tracking
Lan, Jin-Peng, Cheng, Zhi-Qi, He, Jun-Yan, Li, Chenyang, Luo, Bin, Bao, Xu, Xiang, Wangmeng, Geng, Yifeng, Xie, Xuansong
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurately tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance.
Making Vision Transformers Efficient from A Token Sparsification View
Chang, Shuning, Wang, Pichao, Lin, Ming, Wang, Fan, Zhang, David Junhao, Jin, Rong, Shou, Mike Zheng
The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. Code is available at http://github.com/changsn/STViT-R