Goto

Collaborating Authors

 Asia



Supplementary Material IEBins: Iterative Elastic Bins for Monocular Depth Estimation

Neural Information Processing Systems

Table 2 shows a similar performance trend as in NYU-Depth-v2 dataset with increasing number of bins. We report results on keyframes (selected by the ORB-SLAM2) and on all frames of sequences 01-10. The A TE (m) metric is used.



Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Zihao Wang

Neural Information Processing Systems

These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc .





Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models Ziyi Yin 1 Muchao Y e

Neural Information Processing Systems

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks.