batra
Joint M-Best-Diverse Labelings as a Parametric Submodular Minimization
Alexander Kirillov, Alexander Shekhovtsov, Carsten Rother, Bogdan Savchynskyy
We consider the problem of jointly inferring the M-best diverse labelings for a binary (high-order) submodular energy of a graphical model. Recently, it was shown that this problem can be solved to a global optimum, for many practically interesting diversity measures. It was noted that the labelings are, so-called, nested. This nestedness property also holds for labelings of a class of parametric submodular minimization problems, where different values of the global parameter γ give rise to different solutions. The popular example of the parametric submodular minimization is the monotonic parametric max-flow problem, which is also widely used for computing multiple labelings.
Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
Medhini Narasimhan, Svetlana Lazebnik, Alexander Schwing
Accurately answering aquestionabout agivenimage requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains analgorithmic challenge. Toadvance research inthisdirection anovel'fact-based' visual question answering (FVQA) taskhas been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation.
HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment
Yin, Zecheng, Zhao, Hao, Li, Zhen
Abstract-- Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environment is critical for autonomous robots. While egocentric observations from RGB-D sensors provide abundant local information, real-time top-down maps offer valuable global context for ObjNav. Nevertheless, the majority of existing studies focus on a single source, seldom integrating these two complementary perceptual modalities, despite the fact that humans naturally attend to both. With the rapid advancement of Vision-Language Models(VLMs), we propose Hybrid Perception Navigation (HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding capabilities to jointly perceive both local and global information to enhance the effectiveness and intelligence of navigation in unknown environments. In both massive simulation evaluation and real-world validation, our methods achieved state-of-the-art performance against popular baselines. Benefiting from hybrid perception approach, our method captures richer cues and finds the objects more effectively, by simultaneously leveraging information understanding from egocentric observations and the top-down map. Our ablation study further proved that either of the hybrid perception contributes to the navigation performance. The code and datasets are publicly available. Navigating to target objective from human language is a key ability for fully autonomous robots.
M-Best-Diverse Labelings for Submodular Energies and Beyond
Alexander Kirillov, Dmytro Shlezinger, Dmitry P. Vetrov, Carsten Rother, Bogdan Savchynskyy
We consider the problem of finding M best diverse solutions of energy minimization problems for graphical models. Contrary to the sequential method of Batra et al., which greedily finds one solution after another, we infer all M solutions jointly. It was shown recently that such jointly inferred labelings not only have smaller total energy but also qualitatively outperform the sequentially obtained ones. The only obstacle for using this new technique is the complexity of the corresponding inference problem, since it is considerably slower algorithm than the method of Batra et al. In this work we show that the joint inference of M best diverse solutions can be formulated as a submodular energy minimization if the original MAP-inference problem is submodular, hence fast inference techniques can be used. In addition to the theoretical results we provide practical algorithms that outperform the current state-of-the-art and can be used in both submodular and non-submodular case.
Aim My Robot: Precision Local Navigation to Any Object
Meng, Xiangyun, Yang, Xuning, Jung, Sanghun, Ramos, Fabio, Jujjavarapu, Srid Sadhan, Paul, Sanjoy, Fox, Dieter
Abstract-- Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeterlevel precision. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning. But this usually requires specific the goal reached when the robot is within 1m radius to the object information such as 3D models [13], and the object goal [8], [11], [12]. This lax definition of success hinders being initially visible. This limits its applicability when the their applicability to the growing need for mobile robots to object 3D model is not available or the object is initially out navigate to objects with precisely.
Navigation with VLM framework: Go to Any Language
Yin, Zecheng, Cheng, Chonghao, Lizhen, null
Navigating towards fully open language goals and exploring open scenes in a manner akin to human exploration have always posed significant challenges. Recently, Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data. While many works have focused on leveraging VLMs for navigation in open scenes and with open vocabularies, these efforts often fall short of fully utilizing the potential of VLMs or require substantial computational resources. We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes, emulating human exploration behaviors without any prior training. The agent leverages the VLM as its cognitive core to perceive environmental information based on any language goal and constantly provides exploration guidance during navigation until it reaches the target location or area. Our framework not only achieves state-of-the-art performance in Success Rate (SR) and Success weighted by Path Length (SPL) in traditional specific goal settings but also extends the navigation capabilities to any open-set language goal. We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator. With the power of VLMs, navigation has entered a new era.
PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
Zeng, Kuo-Hao, Zhang, Zichen, Ehsani, Kiana, Hendrix, Rose, Salvador, Jordi, Herrasti, Alvaro, Girshick, Ross, Kembhavi, Aniruddha, Weihs, Luca
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.