AITopics

Country: Europe (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)

Medhini Narasimhan, Svetlana Lazebnik, Alexander Schwing

Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Neural Information Processing SystemsFeb-14-2026, 11:19:29 GMT

Accurately answering aquestionabout agivenimage requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains analgorithmic challenge. Toadvance research inthisdirection anovel'fact-based' visual question answering (FVQA) taskhas been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation.

machine learning, natural language, question answering, (21 more...)

Country:

North America > United States > Illinois (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.36)

Simon Kohl, Bernardino Romera-Paredes, Clemens Meyer, Jeffrey De Fauw, Joseph R. Ledsam, Klaus Maier-Hein, S. M. Ali Eslami, Danilo Jimenez Rezende, Olaf Ronneberger

A Probabilistic U-Net for Segmentation of Ambiguous Images

Neural Information Processing SystemsFeb-12-2026, 17:46:12 GMT

Many real-world vision problems suffer from inherent ambiguities. In clinical applications for example, itmight not be clear from aCT scan alone which particular region is cancer tissue. Therefore a group of graders typically produces a set of diverse but plausible segmentations.

ambiguity, artificial intelligence, machine learning, (17 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.95)
Health & Medicine > Therapeutic Area (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Jingxiang Lin, Unnat Jain, Alexander Schwing

TAB-VCR: Tags and Attributes based VCR Baselines

Neural Information Processing SystemsFeb-11-2026, 16:12:01 GMT

W evaluated prior thesingle released.

artificial intelligence, machine learning, natural language, (17 more...)

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > Canada > Alberta (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.41)

arXiv.org Artificial IntelligenceOct-29-2025

HyPerNav: Hybrid Perception for Object-Oriented Navigation in Unknown Environment

Yin, Zecheng, Zhao, Hao, Li, Zhen

Abstract-- Objective-oriented navigation(ObjNav) enables robot to navigate to target object directly and autonomously in an unknown environment. Effective perception in navigation in unknown environment is critical for autonomous robots. While egocentric observations from RGB-D sensors provide abundant local information, real-time top-down maps offer valuable global context for ObjNav. Nevertheless, the majority of existing studies focus on a single source, seldom integrating these two complementary perceptual modalities, despite the fact that humans naturally attend to both. With the rapid advancement of Vision-Language Models(VLMs), we propose Hybrid Perception Navigation (HyPerNav), leveraging VLMs' strong reasoning and vision-language understanding capabilities to jointly perceive both local and global information to enhance the effectiveness and intelligence of navigation in unknown environments. In both massive simulation evaluation and real-world validation, our methods achieved state-of-the-art performance against popular baselines. Benefiting from hybrid perception approach, our method captures richer cues and finds the objects more effectively, by simultaneously leveraging information understanding from egocentric observations and the top-down map. Our ablation study further proved that either of the hybrid perception contributes to the navigation performance. The code and datasets are publicly available. Navigating to target objective from human language is a key ability for fully autonomous robots.

natural language, navigation, object-oriented architecture, (19 more...)

2510.22917

Country:

Asia > China (0.29)
Europe > Austria (0.28)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.83)

Jingxiang Lin, Unnat Jain, Alexander Schwing

TAB-VCR: Tags and Attributes based VCR Baselines

Neural Information Processing SystemsOct-2-2025, 08:37:34 GMT

Reasoning is an important ability that we learn from a very early age.

detection, machine learning, natural language, (20 more...)

Country: North America (0.28)

Industry:

Leisure & Entertainment (0.67)
Education (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Alexander Kirillov, Dmytro Shlezinger, Dmitry P. Vetrov, Carsten Rother, Bogdan Savchynskyy

M-Best-Diverse Labelings for Submodular Energies and Beyond

Neural Information Processing SystemsOct-2-2025, 04:42:32 GMT

We consider the problem of finding M best diverse solutions of energy minimization problems for graphical models. Contrary to the sequential method of Batra et al., which greedily finds one solution after another, we infer all M solutions jointly. It was shown recently that such jointly inferred labelings not only have smaller total energy but also qualitatively outperform the sequentially obtained ones. The only obstacle for using this new technique is the complexity of the corresponding inference problem, since it is considerably slower algorithm than the method of Batra et al. In this work we show that the joint inference of M best diverse solutions can be formulated as a submodular energy minimization if the original MAP-inference problem is submodular, hence fast inference techniques can be used. In addition to the theoretical results we provide practical algorithms that outperform the current state-of-the-art and can be used in both submodular and non-submodular case.

artificial intelligence, diversity measure, machine learning, (15 more...)

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > Germany > Saxony > Dresden (0.04)
Asia > Russia (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

arXiv.org Artificial IntelligenceDec-27-2024

Aim My Robot: Precision Local Navigation to Any Object

Meng, Xiangyun, Yang, Xuning, Jung, Sanghun, Ramos, Fabio, Jujjavarapu, Srid Sadhan, Paul, Sanjoy, Fox, Dieter

Abstract-- Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeterlevel precision. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning. But this usually requires specific the goal reached when the robot is within 1m radius to the object information such as 3D models [13], and the object goal [8], [11], [12]. This lax definition of success hinders being initially visible. This limits its applicability when the their applicability to the growing need for mobile robots to object 3D model is not available or the object is initially out navigate to objects with precisely.

large language model, machine learning, natural language, (19 more...)

2411.1477

Country: North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.46)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.34)

Yin, Zecheng, Cheng, Chonghao, Lizhen, null

Navigation with VLM framework: Go to Any Language

arXiv.org Artificial IntelligenceSep-17-2024

Navigating towards fully open language goals and exploring open scenes in a manner akin to human exploration have always posed significant challenges. Recently, Vision Large Language Models (VLMs) have demonstrated remarkable capabilities in reasoning with both language and visual data. While many works have focused on leveraging VLMs for navigation in open scenes and with open vocabularies, these efforts often fall short of fully utilizing the potential of VLMs or require substantial computational resources. We introduce Navigation with VLM (NavVLM), a framework that harnesses equipment-level VLMs to enable agents to navigate towards any language goal specific or non-specific in open scenes, emulating human exploration behaviors without any prior training. The agent leverages the VLM as its cognitive core to perceive environmental information based on any language goal and constantly provides exploration guidance during navigation until it reaches the target location or area. Our framework not only achieves state-of-the-art performance in Success Rate (SR) and Success weighted by Path Length (SPL) in traditional specific goal settings but also extends the navigation capabilities to any open-set language goal. We evaluate NavVLM in richly detailed environments from the Matterport 3D (MP3D), Habitat Matterport 3D (HM3D), and Gibson datasets within the Habitat simulator. With the power of VLMs, navigation has entered a new era.

large language model, machine learning, natural language, (19 more...)

2410.02787

Country:

Asia > China > Guangdong Province > Shenzhen (0.06)
Asia > China > Hong Kong (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJun-28-2024

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Zeng, Kuo-Hao, Zhang, Zichen, Ehsani, Kiana, Hendrix, Rose, Salvador, Jordi, Herrasti, Alvaro, Girshick, Ross, Kembhavi, Aniruddha, Weihs, Luca

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

agent, benchmark, ormer, (13 more...)