Goto

Collaborating Authors

 Luque, Jordi


Word Sense Disambiguation in Native Spanish: A Comprehensive Lexical Evaluation Resource

arXiv.org Artificial Intelligence

Human language, while aimed at conveying meaning, inherently carries ambiguity. It poses challenges for speech and language processing, but also serves crucial communicative functions. Efficiently solve ambiguity is both a desired and a necessary characteristic. The lexical meaning of a word in context can be determined automatically by Word Sense Disambiguation (WSD) algorithms that rely on external knowledge often limited and biased toward English. When adapting content to other languages, automated translations are frequently inaccurate and a high degree of expert human validation is necessary to ensure both accuracy and understanding. The current study addresses previous limitations by introducing a new resource for Spanish WSD. It includes a sense inventory and a lexical dataset sourced from the Diccionario de la Lengua Espa\~nola which is maintained by the Real Academia Espa\~nola. We also review current resources for Spanish and report metrics on them by a state-of-the-art system.


Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

arXiv.org Artificial Intelligence

Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data with temporal alignments and using detection based on two phases with multi-resolution. It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side, which is an ensemble of heterogeneous architectures that refine detection. This scheme allows the optimization of two operating points. To protect privacy, audio features are sent to the cloud instead of raw audio. The study investigated different parametric configurations for feature extraction to select one for on-device detection and another for the verification model. Furthermore, thirteen different audio classifiers were compared in terms of performance and inference time. The proposed ensemble outperforms our stronger classifier in every noise condition.


Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning

arXiv.org Artificial Intelligence

Many real-time applications (e.g., Augmented/Virtual Reality, cognitive assistance) rely on Deep Neural Networks (DNNs) to process inference tasks. Edge computing is considered a key infrastructure to deploy such applications, as moving computation close to the data sources enables us to meet stringent latency and throughput requirements. However, the constrained nature of edge networks poses several additional challenges to the management of inference workloads: edge clusters can not provide unlimited processing power to DNN models, and often a trade-off between network and processing time should be considered when it comes to end-to-end delay requirements. In this paper, we focus on the problem of scheduling inference queries on DNN models in edge networks at short timescales (i.e., few milliseconds). By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP, highlighting the need for a dynamic scheduling policy that can adapt to network conditions and workloads. We therefore design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions. Our results show that ASET effectively provides the best performance compared to static policies when scheduling over a distributed pool of edge resources.


Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

arXiv.org Artificial Intelligence

High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.


Data Augmentation for Low-Resource Quechua ASR Improvement

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) is a key element in new services that helps users to interact with an automated system. Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English. However, the use of these methods is only available for languages with hundreds or thousands of hours of audio and their corresponding transcriptions. For the so-called low-resource languages to speed up the availability of resources that can improve the performance of their ASR systems, methods of creating new resources on the basis of existing ones are being investigated. In this paper we describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages. We carry out experiments developing an ASR for Quechua using the wav2letter++ model. We reduced WER by 8.73% through our approach to the base model. The resulting ASR model obtained 22.75% WER and was trained with 99 hours of original resources and 99 hours of synthetic data obtained with a combination of text augmentation and synthetic speech generati


Convolutional Speech Recognition with Pitch and Voice Quality Features

arXiv.org Artificial Intelligence

The effects of adding pitch and voice quality features such as jitter and shimmer to a state-of-the-art CNN model for Automatic Speech Recognition are studied in this work. Pitch features have been previously used for improving classical HMM and DNN baselines, while jitter and shimmer parameters have proven to be useful for tasks like speaker or emotion recognition. Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 2% absolute WER points, for the publicly available Spanish Common Voice dataset. Particularly, our work combines these features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs). Such models have shown to yield small word error rates, while being very suitable for parallel processing for online streaming recognition use cases. We have added pitch and voice quality functionality to Facebook's wav2letter speech recognition framework, and we provide with such code and recipes to the community, to carry on with further experiments. Besides, to the best of our knowledge, our Spanish Common Voice recipe is the first public Spanish recipe for wav2letter.


Input complexity and out-of-distribution detection with likelihood-based generative models

arXiv.org Machine Learning

On the other side of the spectrum, we observe that the data set with a lower log-likelihood is Noise, a data set of uniform random images, followed by TrafficSign and TinyImageNet; both featuring colorful images with nontrivial background. Such ordering is perhaps more clear by looking at the average log-likelihood of each data set (Appendix D). If we think about the visual complexity of the images in those data sets, it would seem that log-likelihoods tend to grow when images become simpler and with less information or content. To further confirm the previous observation, we design a controlled experiment where we can set different decreasing levels of image complexity. We train a generative model with some data set, as before, but now compute likelihoods of progressively simpler inputs. Such inputs are obtained by average-pooling the uniform random Noise images by factors of 1, 2, 4, 8, 16, and 32, and re-scaling back the images to the original size by nearest-neighbor up-sampling. Intuitively, a noise image with a pooling size of 1 (no pooling) has the highest complexity, while a noise image with a pooling of 32 (constant-color image) has the lowest complexity.


How Much Does Audio Matter to Recognize Egocentric Object Interactions?

arXiv.org Artificial Intelligence

Sounds are an important source of information on our daily interactions with objects. For instance, a significant amount of people can discern the temperature of water that it is being poured just by using the sense of hearing. However, only a few works have explored the use of audio for the classification of object interactions in conjunction with vision or as single modality. In this preliminary work, we propose an audio model for egocentric action recognition and explore its usefulness on the parts of the problem (noun, verb, and action classification). Our model achieves a competitive result in terms of verb classification (34.26% accuracy) on a standard benchmark with respect to vision-based state of the art systems, using a comparatively lighter architecture.