We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries.
Our dataset significantly surpasses existing labeled step datasets in terms of scale, number of tasks, and richness of natural language step descriptions.
Visual and auditory perception is crucial for observing the world. When we hear a sound, our brain will extract semantic information and locate the sounding source.