Plotting

 Li, Juncheng


Sim-MEES: Modular End-Effector System Grasping Dataset for Mobile Manipulators in Cluttered Environments

arXiv.org Artificial Intelligence

In this paper, we present Sim-MEES: a large-scale synthetic dataset that contains 1,550 objects with varying difficulty levels and physics properties, as well as 11 million grasp labels for mobile manipulators to plan grasps using different gripper modalities in cluttered environments. Our dataset generation process combines analytic models and dynamic simulations of the entire cluttered environment to provide accurate grasp labels. We provide a detailed study of our proposed labeling process for both parallel jaw grippers and suction cup grippers, comparing them with state-of-the-art methods to demonstrate how Sim-MEES can provide precise grasp labels in cluttered environments.


DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention

arXiv.org Artificial Intelligence

Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position. Adopting such input-invariant projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing the sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed non-destructively from a novel perspective of information theory, with compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form. Theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed.


MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

arXiv.org Artificial Intelligence

Text-based image captioning (TextCap) requires simultaneous comprehension of visual content and reading the text of images to generate a natural language description. Although a task can teach machines to understand the complex human environment further given that text is omnipresent in our daily surroundings, it poses additional challenges in normal captioning. A text-based image intuitively contains abundant and complex multimodal relational content, that is, image details can be described diversely from multiview rather than a single caption. Certainly, we can introduce additional paired training data to show the diversity of images' descriptions, this process is labor-intensive and time-consuming for TextCap pair annotations with extra texts. Based on the insight mentioned above, we investigate how to generate diverse captions that focus on different image parts using an unpaired training paradigm. We propose the Multimodal relAtional Graph adversarIal inferenCe (MAGIC) framework for diverse and unpaired TextCap. This framework can adaptively construct multiple multimodal relational graphs of images and model complex relationships among graphs to represent descriptive diversity. Moreover, a cascaded generative adversarial network is developed from modeled graphs to infer the unpaired caption generation in image-sentence feature alignment and linguistic coherence levels. We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image. Experimental results show that MAGIC can generate very promising outcomes without using any image-caption training pairs.


Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations

arXiv.org Artificial Intelligence

In the problem of learning disentangled representations, one of the promising methods is to factorize aggregated posterior by penalizing the total correlation of sampled latent variables. However, this well-motivated strategy has a blind spot: there is a disparity between the sampled latent representation and its corresponding mean representation. In this paper, we provide a theoretical explanation that low total correlation of sampled representation cannot guarantee low total correlation of the mean representation. Indeed, we prove that for the multivariate normal distributions, the mean representation with arbitrarily high total correlation can have a corresponding sampled representation with bounded total correlation. We also propose a method to eliminate this disparity. Experiments show that our model can learn a mean representation with much lower total correlation, hence a factorized mean representation. Moreover, we offer a detailed explanation of the limitations of factorizing aggregated posterior -- factor disintegration. Our work indicates a potential direction for future research of disentangled learning.


Walking with MIND: Mental Imagery eNhanceD Embodied QA

arXiv.org Artificial Intelligence

The EmbodiedQA is a task of training an embodied agent by intelligently navigating in a simulated environment and gathering visual information to answer questions. Existing approaches fail to explicitly model the mental imagery function of the agent, while the mental imagery is crucial to embodied cognition, and has a close relation to many high-level meta-skills such as generalization and interpretation. In this paper, we propose a novel Mental Imagery eNhanceD (MIND) module for the embodied agent, as well as a relevant deep reinforcement framework for training. The MIND module can not only model the dynamics of the environment (e.g. 'what might happen if the agent passes through a door') but also help the agent to create a better understanding of the environment (e.g. 'The refrigerator is usually in the kitchen'). Such knowledge makes the agent a faster and better learner in locating a feasible policy with only a few trails. Furthermore, the MIND module can generate mental images that are treated as short-term subgoals by our proposed deep reinforcement framework. These mental images facilitate policy learning since short-term subgoals are easy to achieve and reusable. This yields better planning efficiency than other algorithms that learn a policy directly from primitive actions. Finally, the mental images visualize the agent's intentions in a way that human can understand, and this endows our agent's actions with more interpretability. The experimental results and further analysis prove that the agent with the MIND module is superior to its counterparts not only in EQA performance but in many other aspects such as route planning, behavioral interpretation, and the ability to generalize from a few examples.


Adversarial camera stickers: A physical camera-based attack on deep learning systems

arXiv.org Machine Learning

Recent work has thoroughly documented the susceptibility of deep learning systems to adversarial examples, but most such instances directly manipulate the digital input to a classifier. Although a smaller line of work considers physical adversarial attacks, in all cases these involve manipulating the object of interest, e.g., putting a physical sticker on a object to misclassify it, or manufacturing an object specifically intended to be misclassified. In this work, we consider an alternative question: is it possible to fool deep classifiers, over all perceived objects of a certain type, by physically manipulating the camera itself? We show that this is indeed possible, that by placing a carefully crafted and mainly-translucent sticker over the lens of a camera, one can create universal perturbations of the observed images that are inconspicuous, yet reliably misclassify target objects as a different (targeted) class. To accomplish this, we propose an iterative procedure for both updating the attack perturbation (to make it adversarial for a given classifier), and the threat model itself (to ensure it is physically realizable). For example, we show that we can achieve physically-realizable attacks that fool ImageNet classifiers in a targeted fashion 49.6% of the time. This presents a new class of physically-realizable threat models to consider in the context of adversarially robust machine learning. Our demo video can be viewed at: https://youtu.be/wUVmL33Fx54