Word Discovery in Visually Grounded, Self-Supervised Speech Models

Open in new window