FindView: Precise Target View Localization Task for Look Around Agents

Ishikawa, Haruya, Aoki, Yoshimitsu

arXiv.org Artificial Intelligence 

The field of research aims to create agents that use visual sensors for solving complex tasks or aid humans by learning to perceive, communicate, and act in their environment. Humans in the loop make the goal very difficult since the dynamics of the environment are changeable, and human interactions can lead to unexpected events. Towards better collaboration between agents and humans, agents must be able to perform localization of any point in space that reflects the characteristics of human's perception of 3D space Cirik et al. [2020]. Since the visual sensors for the agents are commonly RGB sensors employed with partial Field-of-View (FoV), we would need to train these agents to perceive how humans see from these views. Communication with these agents will almost always necessitate the agents to navigate to view a common referential FoV in the scene so that the human can instruct the agents with the shared contexts. Challenge arises since the point of interest could be any point in the scene, and many points in the scene will not correspond to easily named objects. So far, many embodied agents being researched use either partial FoVs or directly use panoramic images that are hard for human observers to understand. We believe that embodied agents should be able to look around and localize in various views that human observers might be looking at. We approach this problem by introducing a new task, namely the FindView task, to evaluate and benchmark the agents (Figure 1).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found