Goto

Collaborating Authors

 infant




Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others

Neural Information Processing Systems

To achieve human-like common sense about everyday life, machine learning systems must understand and reason about the goals, preferences, and actions of other agents in the environment. By the end of their first year of life, human infants intuitively achieve such common sense, and these cognitive achievements lay the foundation for humans' rich and complex understanding of the mental states of others. Can machines achieve generalizable, commonsense reasoning about other agents like human infants?


Baby chimpanzees like to free fall through trees

Popular Science

Chimp infants are three times more likely to take risks than adults. Breakthroughs, discoveries, and DIY tips sent six days a week. Given the many similarities between humans and chimpanzees, one might assume that both species similarly engage in risky behavior within the same age range. However, according to a study recently published in the journal, it turns out that in chimps, it's the infants you have to watch out for. After studying videos of 119 wild chimpanzees, researchers found that chimpanzees' risky behavior peaks in their infancy, and then lessens as they get older.


Opinion: Learning Intuitive Physics May Require More than Visual Data

Su, Ellen, Legris, Solim, Gureckis, Todd M., Ren, Mengye

arXiv.org Artificial Intelligence

Humans expertly navigate the world by building rich internal models founded on an intuitive understanding of physics. Meanwhile, despite training on vast quantities of internet video data, state-of-the-art deep learning models still fall short of human-level performance on intuitive physics benchmarks. This work investigates whether data distribution, rather than volume, is the key to learning these principles. We pretrain a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam, a developmentally realistic, egocentric video dataset partially capturing three children's everyday visual experiences. We find that training on this dataset, which represents 0.01% of the data volume used to train SOTA models, does not lead to significant performance improvements on the IntPhys2 benchmark. Our results suggest that merely training on a developmentally realistic dataset is insufficient for current architectures to learn representations that support intuitive physics. We conclude that varying visual data volume and distribution alone may not be sufficient for building systems with artificial intuitive physics.


Assessing the alignment between infants' visual and linguistic experience using multimodal language models

Tan, Alvin Wei Ming, Yang, Jane, Sepuri, Tarun, Aw, Khai Loong, Sparks, Robert Z., Yin, Zi, Marchman, Virginia A., Frank, Michael C., Long, Bria

arXiv.org Artificial Intelligence

Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.



Chimpanzees' brutal battle for territory leads to a baby boom

Popular Science

Chimpanzees' brutal battle for territory leads to a baby boom A rival chimp can die in less than 15 minutes during these deadly territorial fights. New research led by UCLA and the University of Michigan has shown that chimp communities that kill their neighbors to gain territory also gain reproductive advantages. Breakthroughs, discoveries, and DIY tips sent every weekday. Uganda's Ngogo chimpanzees are well known for their "chimpanzee warfare." Primatologists have observed their brutal, lethal fights between 10 or more chimpanzees for decades, deciphering what leads to such violence.


Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard

Zarifis, Stelios, Chalkiadakis, Ioannis, Chardouveli, Artemis, Moutzouri, Vasiliki, Sotirchos, Aggelos, Papadimitriou, Katerina, Filntisis, Panagiotis, Efthymiou, Niki, Maragos, Petros, Pastra, Katerina

arXiv.org Artificial Intelligence

Inspired by infant development, we propose a Reinforcement Learning (RL) framework for autonomous self-exploration in a robotic agent, Baby Sophia, using the BabyBench simulation environment. The agent learns self-touch and hand regard behaviors through intrinsic rewards that mimic an infant's curiosity-driven exploration of its own body. For self-touch, high-dimensional tactile inputs are transformed into compact, meaningful representations, enabling efficient learning. The agent then discovers new tactile contacts through intrinsic rewards and curriculum learning that encourage broad body coverage, balance, and generalization. For hand regard, visual features of the hands, such as skin-color and shape, are learned through motor babbling. Then, intrinsic rewards encourage the agent to perform novel hand motions, and follow its hands with its gaze. A curriculum learning setup from single-hand to dual-hand training allows the agent to reach complex visual-motor coordination. The results of this work demonstrate that purely curiosity-based signals, with no external supervision, can drive coordinated multimodal learning, imitating an infant's progression from random motor babbling to purposeful behaviors.


BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Wang, Shengao, Chandra, Arjun, Liu, Aoming, Saligrama, Venkatesh, Gong, Boqing

arXiv.org Artificial Intelligence

Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.