Goto

Collaborating Authors

 Object-Oriented Architecture


An Extensible Multimodal Multi-task Object Dataset with Materials

arXiv.org Artificial Intelligence

We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon's product-category taxonomy. Objects are annotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale. Perhaps the biggest problem faced by machine learning practitioners today is that of producing labeled datasets for their specific needs. Manually labeling large amounts of data is time-consuming and costly (Deng et al., 2009; Lin et al., 2014; Kuznetsova et al., 2020). Furthermore, it is often not possible to communicate how numerous ambiguous corner cases should be handled (e.g., is a hole puncher "sharp"?) to the human annotators we typically rely on to produce these labels. Could we solve this problem with the aid of machine learning? We hypothesized that we could accurately add new properties to every instance in a semi-automated fashion if given a rich dataset with substantial information about every instance. Consequently, we developed EMMa, a large, object-centric, multimodal, and multi-task dataset. We show that EMMa can be easily extended to contain any number of new object labels using a Smart Labeling technique we developed for large multi-task and multimodal datasets. Multi-task datasets contain labels for more than one attribute for each instance, whereas multimodal datasets contain data from more than one modality, such as images, text, audio, and tabular data. Derived from Amazon product listings, EMMa contains images, text, and a number of useful attributes, such as materials, mass, price, product category, and product ratings. Each attribute can be used as either a model input or a model output.


Compositional 3D Human-Object Neural Animation

arXiv.org Artificial Intelligence

Human-object interactions (HOIs) are crucial for human-centric scene understanding applications such as human-centric visual generation, AR/VR, and robotics. Since existing methods mainly explore capturing HOIs, rendering HOI remains less investigated. In this paper, we address this challenge in HOI animation from a compositional perspective, i.e., animating novel HOIs including novel interaction, novel human and/or novel object driven by a novel pose sequence. Specifically, we adopt neural human-object deformation to model and render HOI dynamics based on implicit neural representations. To enable the interaction pose transferring among different persons and objects, we then devise a new compositional conditional neural radiance field (or CC-NeRF), which decomposes the interdependence between human and object using latent codes to enable compositionally animation control of novel HOIs. Experiments show that the proposed method can generalize well to various novel HOI animation settings. Our project page is https://zhihou7.github.io/CHONA/


TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation

arXiv.org Artificial Intelligence

How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.


Reason from Context with Self-supervised Learning

arXiv.org Artificial Intelligence

Self-supervised learning (SSL) learns to capture discriminative visual features useful for knowledge transfers. To better accommodate the object-centric nature of current downstream tasks such as object recognition and detection, various methods have been proposed to suppress contextual biases or disentangle objects from contexts. Nevertheless, these methods may prove inadequate in situations where object identity needs to be reasoned from associated context, such as recognizing or inferring tiny or obscured objects. As an initial effort in the SSL literature, we investigate whether and how contextual associations can be enhanced for visual reasoning within SSL regimes, by (a) proposing a new Self-supervised method with external memories for Context Reasoning (SeCo), and (b) introducing two new downstream tasks, lift-the-flap and object priming, addressing the problems of "what" and "where" in context reasoning. In both tasks, SeCo outperformed all state-of-the-art (SOTA) SSL methods by a significant margin. Our network analysis revealed that the proposed external memory in SeCo learns to store prior contextual knowledge, facilitating target identity inference in the lift-the-flap task. Moreover, we conducted psychophysics experiments and introduced a Human benchmark in Object Priming dataset (HOP). Our results demonstrate that SeCo exhibits human-like behaviors.


Head-tail Loss: A simple function for Oriented Object Detection and Anchor-free models

arXiv.org Artificial Intelligence

This paper presents a new loss function for the prediction of oriented bounding boxes, named head-tail-loss. The loss function consists in minimizing the distance between the prediction and the annotation of two key points that are representing the annotation of the object. The first point is the center point and the second is the head of the object. However, for the second point, the minimum distance between the prediction and either the head or tail of the groundtruth is used. On this way, either prediction is valid (with the head pointing to the tail or the tail pointing to the head). At the end the importance is to detect the direction of the object but not its heading. The new loss function has been evaluated on the DOTA and HRSC2016 datasets and has shown potential for elongated objects such as ships and also for other types of objects with different shapes.


An Object-Oriented Framework for the Simulation of Neural Nets

Neural Information Processing Systems

The field of software simulators for neural networks has been ex(cid:173) panding very rapidly in the last years but their importance is still being underestimated. They must provide increasing levels of as(cid:173) sistance for the design, simulation and analysis of neural networks. With our object-oriented framework (SESAME) we intend to show that very high degrees of transparency, manageability and flexibil(cid:173) ity for complex experiments can be obtained. SESAME's basic de(cid:173) sign philosophy is inspired by the natural way in which researchers explain their computational models. Experiments are performed with networks of building blocks, which can be extended very eas(cid:173) ily.


Neural Basis of Object-Centered Representations

Neural Information Processing Systems

We present a neural model that can perform eye movements to a particular side of an object regardless of the position and orienta(cid:173) tion of the object in space, a generalization of a task which has been recently used by Olson and Gettner [4] to investigate the neu(cid:173) ral structure of object-centered representations. Our model uses an intermediate representation in which units have oculocentric recep(cid:173) tive fields- just like collicular neurons- whose gain is modulated by the side of the object to which the movement is directed, as well as the orientation of the object. We show that these gain modulations are consistent with Olson and Gettner's single cell recordings in the supplementary eye field. This demonstrates that it is possible to perform an object-centered task without a representation involv(cid:173) ing an object-centered map, viz., without neurons whose receptive fields are defined in object-centered coordinates. We also show that the same approach can account for object-centered neglect, a situ(cid:173) ation in which patients with a right parietal lesion neglect the left side of objects regardless of the orientation of the objects.


Contextual Modulation of Target Saliency

Neural Information Processing Systems

The most popular algorithms for object detection require the use of exhaustive spatial and scale search procedures. In such approaches, an object is defined by means of local features. We present results with real images showing that the proposed scheme is able to accurately predict likely object classes, locations and sizes.


Efficient Unsupervised Learning for Localization and Detection in Object Categories

Neural Information Processing Systems

We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model repre- sents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The com- plexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be or- ders of magnitude faster than previous approaches while incorporating many more features. Our model has been carefully tested on standard datasets; we compare with a number of recent template models.


A Computational Model of Eye Movements during Object Class Detection

Neural Information Processing Systems

We present a computational model of human eye movements in an ob- ject class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using Ad- aBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acqui- sition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex non- target objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.