LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Neural Information Processing Systems 

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format).