long-form recording
BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings
Charlot, Théo, Kunze, Tarek, Poli, Maxime, Cristia, Alejandrina, Dupoux, Emmanuel, Lavechin, Marvin
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation model trained on 13,000 hours of multilingual child-centered long-form recordings spanning over 40 languages. We evaluate BabyHuBERT on speaker segmentation, identifying when target children speak versus female adults, male adults, or other children -- a fundamental preprocessing step for analyzing naturalistic language experiences. BabyHuBERT achieves F1-scores from 52.1% to 74.4% across six diverse datasets, consistently outperforming W2V2-LL4300 (trained on English long-forms) and standard HuBERT (trained on clean adult speech). Notable improvements include 13.2 absolute F1 points over HuBERT on Vanuatu and 15.9 points on Solomon Islands corpora, demonstrating effectiveness on underrepresented languages. By sharing code and models, BabyHuBERT serves as a foundation model for child speech research, enabling fine-tuning on diverse downstream tasks.
- Oceania > Solomon Islands (0.25)
- Oceania > Vanuatu (0.24)
- South America > Bolivia (0.05)
- (10 more...)
Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
Kunze, Tarek, Métais, Marianne, Titeux, Hadrien, Elbert, Lucas, Coffey, Joseph, Dupoux, Emmanuel, Cristia, Alejandrina, Lavechin, Marvin
Recordings gathered with child-worn devices promised to revolutionize both fundamental and applied speech sciences by allowing the effortless capture of children's naturalistic speech environment and language production. This promise hinges on speech technologies that can transform the sheer mounds of data thus collected into usable information. This paper demonstrates several obstacles blocking progress by summarizing three years' worth of experiments aimed at improving one fundamental task: Voice Type Classification. Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance. More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom (0.04)
- Europe > France (0.04)
- Africa > Senegal (0.04)