Akshay Deepak Department of Computer Science National Institute of Technology Patna, India Email: firstname.lastname@example.org Abstract --Convolutional neural networks (CNN) are widely used for speech emotion recognition (SER). In such cases, the short time fourier transform (STFT) spectrogram is the most popular choice for representing speech, which is fed as input to the CNN. However, the uncertainty principles of the short-time Fourier transform prevent it from capturing time and frequency resolutions simultaneously. On the other hand, the recently proposed single frequency filtering (SFF) spectrogram promises to be a better alternative because it captures both time and frequency resolutions simultaneously. In this work, we explore the SFF spectrogram as an alternative representation of speech for SER. We have modified the SFF spectrogram by taking the average of the amplitudes of all the samples between two successive glottal closure instants (GCI) locations. The duration between two successive GCI locations gives the pitch, motivating us to name the modified SFF spectrogram as pitch-synchronous SFF spectrogram. The GCI locations were detected using zero frequency filtering approach. The proposed pitch-synchronous SFF spectrogram produced accuracy values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset. These correspond to an improvement of 7.35% (unweighted) and 4.3% (weighted) over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the proposed method recognized 22.7% of the happy emotion samples correctly, whereas this number was 0% for state-of-the-art results. These results also promise a much wider use of the proposed pitch-synchronous SFF spectrogram for other speech-based applications. I NTRODUCTION S peech emotion recognition (SER) refers to the classification/recognition of the person's emotional state using the speech signal. SER has a lot of applications in real life.
Over the past decade or so, advances in machine learning have paved the way for the development of increasingly advanced speech recognition tools. By analyzing audio files of human speech, these tools can learn to identify words and phrases in different languages, converting them into a machine-readable format. While several machine learning-based models have achieved promising results on speech recognition tasks, they do not always perform well in all languages. For instance, when a language has a vocabulary with many similar-sounding words, the performance of speech recognition systems can decline considerably. Researchers at Mahatma Gandhi Mission's College of Engineering & Technology and Jaypee Institute of Information Technology, in India, have developed a speech recognition system to tackle this problem.
Language is a valuable source of clinical information in Alzheimer's Disease, as it declines concurrently with neurodegeneration. Consequently, speech and language data have been extensively studied in connection with its diagnosis. This paper summarises current findings on the use of artificial intelligence, speech and language processing to predict cognitive decline in the context of Alzheimer's Disease, detailing current research procedures, highlighting their limitations and suggesting strategies to address them. We conducted a systematic review of original research between 2000 and 2019, registered in PROSPERO (reference CRD42018116606). An interdisciplinary search covered six databases on engineering (ACM and IEEE), psychology (PsycINFO), medicine (PubMed and Embase) and Web of Science. Bibliographies of relevant papers were screened until December 2019. From 3,654 search results 51 articles were selected against the eligibility criteria. Four tables summarise their findings: study details (aim, population, interventions, comparisons, methods and outcomes), data details (size, type, modalities, annotation, balance, availability and language of study), methodology (pre-processing, feature generation, machine learning, evaluation and results) and clinical applicability (research implications, clinical potential, risk of bias and strengths/limitations). While promising results are reported across nearly all 51 studies, very few have been implemented in clinical research or practice. We concluded that the main limitations of the field are poor standardisation, limited comparability of results, and a degree of disconnect between study aims and clinical applications. Attempts to close these gaps should support translation of future research into clinical practice.
When the available data of a target speaker is insufficient to train a high quality speaker-dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers and train a multi-speaker TTS model instead. Many studies have shown that neural multi-speaker TTS model trained with a small amount data from multiple speakers combined can generate synthetic speech with better quality and stability than a speaker-dependent one. However when the amount of data from each speaker is highly unbalanced, the best approach to make use of the excessive data remains unknown. Our experiments showed that simply combining all available data from every speaker to train a multi-speaker model produces better than or at least similar performance to its speaker-dependent counterpart. Moreover by using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, we can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.
Bengaluru-based food delivery major Swiggy is looking to incorporate artificial intelligence (AI) -driven speech recognition models for its call centre process. The company has partnered with BPO and social enterprise IndiVillage to power the platform's broader AI and machine learning (ML) charter. This engagement will also include voice annotation work that provides training data for Swiggy's ML algorithms. Swiggy, in its press statement, explained that there is a need to efficiently extract information from the call data when a call centre service executives move from one call to another. This will help the executive understand the'voice of the customers' to enable a deeper understanding of the issues faced by customers and accordingly solve for the same.