Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

Open in new window