AITopics | Stone, Austin

Collaborating Authors

Stone, Austin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning Visual Composition through Improved Semantic Guidance

Stone, Austin, Soltau, Hagen, Geirhos, Robert, Yi, Xi, Xia, Ye, Cao, Bingyi, Chen, Kaifeng, Ogale, Abhijit, Shlens, Jonathon

arXiv.org Artificial IntelligenceDec-19-2024

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.15396

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Open-World Object Manipulation using Pre-trained Vision-Language Models

Stone, Austin, Xiao, Ted, Lu, Yao, Gopalakrishnan, Keerthana, Lee, Kuang-Huei, Vuong, Quan, Wohlhart, Paul, Kirmani, Sean, Zitkovich, Brianna, Xia, Fei, Finn, Chelsea, Hausman, Karol

arXiv.org Artificial IntelligenceOct-25-2023

For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary, e.g. "can you get me the pink stuffed whale?" to their sensory observations and actions. This brings up a notably difficult challenge for robots: while robot learning approaches allow robots to learn many different behaviors from first-hand experience, it is impractical for robots to have first-hand experiences that span all of this semantic information. We would like a robot's policy to be able to perceive and pick up the pink stuffed whale, even if it has never seen any data interacting with a stuffed whale before. Fortunately, static data on the internet has vast semantic information, and this information is captured in pre-trained vision-language models. In this paper, we study whether we can interface robot policies with these pre-trained models, with the aim of allowing robots to complete instructions involving object categories that the robot has never seen first-hand. We develop a simple approach, which we call Manipulation of Open-World Objects (MOO), which leverages a pre-trained vision-language model to extract object-identifying information from the language command and image, and conditions the robot policy on the current image, the instruction, and the extracted object information. In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments. In addition, we show how MOO generalizes to other, non-language-based input modalities to specify the object of interest such as finger pointing, and how it can be further extended to enable open-world navigation and manipulation. The project's website and evaluation videos can be found at https://robot-moo.github.io/

artificial intelligence, open-world object manipulation, pre-trained vision-language model

arXiv.org Artificial Intelligence

2303.00905

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, Anthony, Brown, Noah, Carbajal, Justice, Chebotar, Yevgen, Dabis, Joseph, Finn, Chelsea, Gopalakrishnan, Keerthana, Hausman, Karol, Herzog, Alex, Hsu, Jasmine, Ibarz, Julian, Ichter, Brian, Irpan, Alex, Jackson, Tomas, Jesmonth, Sally, Joshi, Nikhil J, Julian, Ryan, Kalashnikov, Dmitry, Kuang, Yuheng, Leal, Isabel, Lee, Kuang-Huei, Levine, Sergey, Lu, Yao, Malla, Utsav, Manjunath, Deeksha, Mordatch, Igor, Nachum, Ofir, Parada, Carolina, Peralta, Jodilyn, Perez, Emily, Pertsch, Karl, Quiambao, Jornell, Rao, Kanishka, Ryoo, Michael, Salazar, Grecia, Sanketi, Pannag, Sayed, Kevin, Singh, Jaspiar, Sontakke, Sumedh, Stone, Austin, Tan, Clayton, Tran, Huong, Vanhoucke, Vincent, Vega, Steve, Vuong, Quan, Xia, Fei, Xiao, Ted, Xu, Peng, Xu, Sichun, Yu, Tianhe, Zitkovich, Brianna

arXiv.org Artificial IntelligenceAug-11-2023

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2212.06817

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Token Turing Machines

Ryoo, Michael S., Gopalakrishnan, Keerthana, Kahatapitiya, Kumara, Xiao, Ted, Rao, Kanishka, Stone, Austin, Lu, Yao, Ibarz, Julian, Arnab, Anurag

arXiv.org Artificial IntelligenceApr-13-2023

Our model is for handling longer sequence lengths themselves are often inspired by the seminal Neural Turing Machine, and has an not sufficient since we do not want to run our entire transformer external memory consisting of a set of tokens which summarise model for each time step when a new observation the previous history (i.e., frames). This memory is (e.g., a new frame) is provided. This necessitates developing efficiently addressed, read and written using a Transformer models with explicit memories, enabling a model to fuse as the processing unit/controller at each step. The model's relevant past history with current observation to make a prediction memory module ensures that a new observation will only at current time step. Another desideratum for such be processed with the contents of the memory (and not the models, to scale to long sequence lengths, is that the computational entire history), meaning that it can efficiently process long cost at each time step should be constant, regardless sequences with a bounded computational cost at each step. of the length of the previous history. We show that TTM outperforms other alternatives, such as In this paper, we propose Token Turing Machines (TTMs), other Transformer models designed for long sequences and a sequential, auto-regressive model with external memory recurrent neural networks, on two real-world sequential visual and constant computational time complexity at each step.

artificial intelligence, machine learning, transformer, (17 more...)

arXiv.org Artificial Intelligence

2211.09119

Genre:

Workflow (0.88)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling Robot Learning with Semantically Imagined Experience

Yu, Tianhe, Xiao, Ted, Stone, Austin, Tompson, Jonathan, Brohan, Anthony, Wang, Su, Singh, Jaspiar, Tan, Clayton, M, Dee, Peralta, Jodilyn, Ichter, Brian, Hausman, Karol, Xia, Fei

arXiv.org Artificial IntelligenceFeb-22-2023

Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which are challenging to scale. To mitigate this issue, we propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing to obtain meaningful data for robot learning without requiring additional robot data. We term our method Robot Learning with Semantically Imagened Experience (ROSIE). Specifically, we make use of the state of the art text-to-image diffusion models and perform aggressive data augmentation on top of our existing robotic manipulation datasets via inpainting various unseen objects for manipulation, backgrounds, and distractors with text guidance. Through extensive real-world experiments, we show that manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors. In addition, we find that we can improve the robustness and generalization of high-level robot learning tasks such as success detection through training with the diffusion-based data augmentation. The project's website and videos can be found at diffusion-rosie.github.io

artificial intelligence, augmentation, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2302.1155

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Conditional Object-Centric Learning from Video

Kipf, Thomas, Elsayed, Gamaleldin F., Mahendran, Aravindh, Stone, Austin, Sabour, Sara, Heigold, Georg, Jonschkowski, Rico, Dosovitskiy, Alexey, Greff, Klaus

arXiv.org Machine LearningNov-24-2021

Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for any supervision. However, such fully-unsupervised methods still fail to scale to diverse realistic data, despite the use of increasingly complex inductive biases such as priors for the size of objects or the 3D geometry of the scene. In this paper, we instead take a weakly-supervised approach and focus on how 1) using the temporal dynamics of video data in the form of optical flow and 2) conditioning the model on simple object location cues can be used to enable segmenting and tracking objects in significantly more realistic synthetic data. We introduce a sequential extension to Slot Attention which we train to predict optical flow for realistic looking synthetic scenes and show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation. These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences. We also find that such initial-state-conditioning can be used during inference as a flexible interface to query the model for specific objects or parts of objects, which could pave the way for a range of weakly-supervised approaches and allow more effective interaction with trained models.

artificial intelligence, machine learning, savi, (17 more...)

arXiv.org Machine Learning

2111.12594

Country: Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

The Distracting Control Suite -- A Challenging Benchmark for Reinforcement Learning from Pixels

Stone, Austin, Ramirez, Oscar, Konolige, Kurt, Jonschkowski, Rico

arXiv.org Artificial IntelligenceJan-7-2021

Robots have to face challenging perceptual settings, including changes in viewpoint, lighting, and background. Current simulated reinforcement learning (RL) benchmarks such as DM Control provide visual input without such complexity, which limits the transfer of well-performing methods to the real world. In this paper, we extend DM Control with three kinds of visual distractions (variations in background, color, and camera pose) to produce a new challenging benchmark for vision-based control, and we analyze state of the art RL algorithms in these settings. Our experiments show that current RL methods for vision-based control perform poorly under distractions, and that their performance decreases with increasing distraction complexity, showing that new methods are needed to cope with the visual complexities of the real world. We also find that combinations of multiple distraction types are more difficult than a mere combination of their individual effects.

artificial intelligence, distraction, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2101.02722

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Towards Object Detection from Motion

Jonschkowski, Rico, Stone, Austin

arXiv.org Machine LearningSep-17-2019

We present a novel approach to weakly supervised object detection. Instead of annotated images, our method only requires two short videos to learn to detect a new object: 1) a video of a moving object and 2) one or more "negative" videos of the scene without the object. The key idea of our algorithm is to train the object detector to produce physically plausible object motion when applied to the first video and to not detect anything in the second video. With this approach, our method learns to locate objects without any object location annotations. Once the model is trained, it performs object detection on single images. We evaluate our method in three robotics settings that afford learning objects from motion: observing moving objects, watching demonstrations of object manipulation, and physically interacting with objects (see a video summary at https://youtu.be/BH0Hv3zZG_4).

artificial intelligence, neural network, video, (15 more...)

arXiv.org Machine Learning

1909.1295

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback