AITopics | Hendrix, Rose

Collaborating Authors

Hendrix, Rose

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The One RING: a Robotic Indoor Navigation Generalist

Eftekhar, Ainaz, Weihs, Luca, Hendrix, Rose, Caglar, Ege, Salvador, Jordi, Herrasti, Alvaro, Han, Winson, VanderBil, Eli, Kembhavi, Aniruddha, Farhadi, Ali, Krishna, Ranjay, Ehsani, Kiana, Zeng, Kuo-Hao

arXiv.org Artificial IntelligenceDec-18-2024

Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific; a policy learned using one robot's configuration does not typically gracefully generalize to another. Even small changes in the body size or camera viewpoint may cause failures. With the recent surge in custom hardware developments, it is necessary to learn a single policy that can be transferred to other embodiments, eliminating the need to (re)train for each specific robot. In this paper, we introduce RING (Robotic Indoor Navigation Generalist), an embodiment-agnostic policy, trained solely in simulation with diverse randomly initialized embodiments at scale. Specifically, we augment the AI2-THOR simulator with the ability to instantiate robot embodiments with controllable configurations, varying across body size, rotation pivot point, and camera configurations. In the visual object-goal navigation task, RING achieves robust performance on real unseen robot platforms (Stretch RE-1, LoCoBot, Unitree's Go1), achieving an average of 72.1% and 78.9% success rate across 5 embodiments in simulation and 4 robot platforms in the real world. (project website: https://one-ring-policy.allen.ai/)

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.14401

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

SAT: Spatial Aptitude Training for Multimodal Language Models

Ray, Arijit, Duan, Jiafei, Tan, Reuben, Bashkirova, Dina, Hendrix, Rose, Ehsani, Kiana, Kembhavi, Aniruddha, Plummer, Bryan A., Krishna, Ranjay, Zeng, Kuo-Hao, Saenko, Kate

arXiv.org Artificial IntelligenceDec-10-2024

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at http://arijitray1993.github.io/SAT/ .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.07755

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report (0.81)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, Matt, Clark, Christopher, Lee, Sangho, Tripathi, Rohun, Yang, Yue, Park, Jae Sung, Salehi, Mohammadreza, Muennighoff, Niklas, Lo, Kyle, Soldaini, Luca, Lu, Jiasen, Anderson, Taira, Bransom, Erin, Ehsani, Kiana, Ngo, Huong, Chen, YenSung, Patel, Ajay, Yatskar, Mark, Callison-Burch, Chris, Head, Andrew, Hendrix, Rose, Bastani, Favyen, VanderBilt, Eli, Lambert, Nathan, Chou, Yvonne, Chheda, Arnavi, Sparks, Jenna, Skjonsberg, Sam, Schmitz, Michael, Sarnat, Aaron, Bischoff, Byron, Walsh, Pete, Newell, Chris, Wolters, Piper, Gupta, Tanmay, Zeng, Kuo-Hao, Borchardt, Jon, Groeneveld, Dirk, Nam, Crystal, Lebrecht, Sophie, Wittlif, Caitlin, Schoenick, Carissa, Michel, Oscar, Krishna, Ranjay, Weihs, Luca, Smith, Noah A., Hajishirzi, Hannaneh, Girshick, Ross, Farhadi, Ali, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceDec-5-2024

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2409.17146

Country: North America > United States > Texas > Harris County > Houston (0.14)

Genre:

Research Report (0.63)
Questionnaire & Opinion Survey (0.46)

Industry:

Consumer Products & Services (0.67)
Media (0.67)
Leisure & Entertainment (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

Hu, Jiaheng, Hendrix, Rose, Farhadi, Ali, Kembhavi, Aniruddha, Martin-Martin, Roberto, Stone, Peter, Zeng, Kuo-Hao, Ehsani, Kiana

arXiv.org Artificial IntelligenceSep-30-2024

In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale Reinforcement Learning fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance both on previously demonstrated and on entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79.5% in unseen environments, with absolute improvements of +23.6% in simulation and +30.7% on real robots over prior SoTA methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Videos can be found on the project website at https://robot-flare.github.io/

large language model, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2409.16578

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Zeng, Kuo-Hao, Zhang, Zichen, Ehsani, Kiana, Hendrix, Rose, Salvador, Jordi, Herrasti, Alvaro, Girshick, Ross, Kembhavi, Aniruddha, Weihs, Luca

arXiv.org Artificial IntelligenceJun-28-2024

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2406.20083

Country:

North America > United States (0.68)
Africa (0.46)
Europe (0.46)
(2 more...)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

Ehsani, Kiana, Gupta, Tanmay, Hendrix, Rose, Salvador, Jordi, Weihs, Luca, Zeng, Kuo-Hao, Singh, Kunal Pratap, Kim, Yejin, Han, Winson, Herrasti, Alvaro, Krishna, Ranjay, Schwenk, Dustin, VanderBilt, Eli, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceDec-5-2023

Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely expensive. In this work, we show that imitating shortest-path planners in simulation produces agents that, given a language instruction, can proficiently navigate, explore, and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end, transformer-based, SPOC architecture, powerful visual encoders paired with extensive image augmentation, and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200,000 procedurally generated houses containing 40,000 unique 3D assets. Our models, data, training code, and newly proposed 10-task benchmarking suite CHORES will be open-sourced.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.02976

Country:

Europe (1.00)
North America > United States (0.93)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Industry:

Appliances & Durable Goods (0.92)
Leisure & Entertainment > Sports (0.92)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

Deitke, Matt, Hendrix, Rose, Weihs, Luca, Farhadi, Ali, Ehsani, Kiana, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceDec-8-2022

Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.

machine learning, natural language, navigation, (19 more...)

arXiv.org Artificial Intelligence

2212.04819

Country: North America (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback