Visually Grounded Keyword Detection and Localisation for Low-Resource Languages
–arXiv.org Artificial Intelligence
This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword localisation in low-resource settings.
arXiv.org Artificial Intelligence
Feb-1-2023
- Country:
- Africa
- Asia > China (0.04)
- North America > United States
- New York (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Leisure & Entertainment > Sports (1.00)
- Media (0.67)
- Technology:
- Information Technology
- Artificial Intelligence
- Cognitive Science (0.92)
- Machine Learning
- Learning Graphical Models (0.67)
- Neural Networks > Deep Learning (1.00)
- Statistical Learning (1.00)
- Natural Language
- Information Retrieval (1.00)
- Machine Translation (0.67)
- Representation & Reasoning (1.00)
- Speech > Speech Recognition (1.00)
- Vision (1.00)
- Communications
- Networks (0.67)
- Social Media (1.00)
- Information Management (0.67)
- Sensing and Signal Processing > Image Processing (1.00)
- Artificial Intelligence
- Information Technology