Vision and language pretraining in the absence of caption annotations
Consider for a moment what it takes to visually identify and describe something to another person. Now imagine that the other person can't see the object or image, so every detail matters. How do you decide what information is important and what's not? You'll need to know exactly what everything is, where it is, what it's doing in relation to other objects, and note other attributes like color or position of objects in the foreground or background. This exercise shows there's no question that translating images into words is a complex task--one humans do so often and innately it seems automatic at times--requiring a wide range of knowledge about many unique things. In order to translate this skill into artificial intelligence (AI), we need to carefully consider and adapt models to the deep relationships between words and objects, the way they interrelate in expected and unexpected ways, and how contexts like environment and pose of an object affect the subtleties of associating and understanding new objects within categories.
Oct-15-2020, 05:25:57 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.99)
- Natural Language (1.00)
- Vision (0.77)
- Information Technology > Artificial Intelligence