A multimodal developmental benchmark for language learning

Neural Information Processing Systems 

How (dis)similar are the learning trajectories of vision-language models and children? Recent modeling work has attempted to understand the gap between models' and humans' data efficiency by constructing models trained on less data, especially multimodal naturalistic data. However, such models are often evaluated on adultlevel benchmarks, with limited breadth in language abilities tested, and without direct comparison to behavioral data.