AITopics | Abdelaziz, Ahmed Hussen

Collaborating Authors

Abdelaziz, Ahmed Hussen

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Palaskar, Shruti, Rudovic, Oggi, Dharur, Sameer, Pesce, Florian, Krishna, Gautam, Sivaraman, Aswin, Berkowitz, Jack, Abdelaziz, Ahmed Hussen, Adya, Saurabh, Tewfik, Ahmed

arXiv.org Artificial IntelligenceJun-13-2024

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.09617

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Kumar, Satyam, Buddi, Sai Srujana, Sarawgi, Utkarsh Oggy, Garg, Vineet, Ranjan, Shivesh, Ognjen, null, Rudovic, null, Abdelaziz, Ahmed Hussen, Adya, Saurabh

arXiv.org Artificial IntelligenceJun-11-2024

Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.

artificial intelligence, detection, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2406.09443

Country: North America > United States > Texas (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.90)

Add feedback

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Jung, Jee-weon, Zhang, Wangyou, Shi, Jiatong, Aldeneh, Zakaria, Higuchi, Takuya, Theobald, Barry-John, Abdelaziz, Ahmed Hussen, Watanabe, Shinji

arXiv.org Artificial IntelligenceJan-30-2024

This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

artificial intelligence, machine learning, proc, (17 more...)

arXiv.org Artificial Intelligence

2401.1723

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Krishna, Gautam, Dharur, Sameer, Rudovic, Oggi, Dighe, Pranay, Adya, Saurabh, Abdelaziz, Ahmed Hussen, Tewfik, Ahmed H

arXiv.org Artificial IntelligenceOct-23-2023

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by upto 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time.

artificial intelligence, multimodal device directed speech detection, speech recognition, (2 more...)

arXiv.org Artificial Intelligence

2310.15261

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Modality Dropout for Improved Performance-driven Talking Faces

Abdelaziz, Ahmed Hussen, Theobald, Barry-John, Dixon, Paul, Knothe, Reinhard, Apostoloff, Nicholas, Kajareker, Sachin

arXiv.org Machine LearningMay-27-2020

We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

animation, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

2005.13616

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Abdelaziz, Ahmed Hussen, Theobald, Barry-John, Binder, Justin, Fanelli, Gabriele, Dixon, Paul, Apostoloff, Nicholas, Weise, Thibaut, Kajareker, Sachin

arXiv.org Machine LearningMay-14-2019

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

acoustic model, deep learning, speech synthesis, (20 more...)

arXiv.org Machine Learning

1905.0686

Country: North America > United States > California > Santa Clara County (0.15)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback