A Single Transformer for Scalable Vision-Language Modeling

Open in new window