Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
–Neural Information Processing Systems
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual'foundation models' for Embodied AI. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either.
Neural Information Processing Systems
Oct-9-2024, 08:35:58 GMT
- Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.40)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.77)
- Vision (0.60)
- Information Technology > Artificial Intelligence