Microsoft's AI learns to answer questions about scenes from image-text pairs
Machines struggle to make sense of scenes and language without detailed accompanying annotations. Unfortunately, labeling is generally time-consuming and expensive, and even the best labels convey an understanding only of scenes and not of language. In an attempt to remedy the problem, Microsoft researchers conceived of an AI system that trains on image-text pairs in a fashion mimicking the way humans improve their understanding of the world. They say that their single-model encoder-decoder Vision-Language Pre-training (VLP) model, which can both generate image descriptions and answer natural language questions about scenes, lays the groundwork for future frameworks that could reach human parity. A model pretrained using three million image-text pairs is available on GitHub in open source.
Oct-10-2019, 16:01:29 GMT
- Technology: