Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Neural Information Processing Systems 

First, a dynamic vision module that enables a variable and learn-able number of box proposals.