AITopics | Aldarmaki, Hanan

Collaborating Authors

Aldarmaki, Hanan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

JEEM: Vision-Language Understanding in Four Arabic Dialects

Kadaoui, Karima, Atwany, Hanin, Al-Ali, Hamdan, Mohamed, Abdelrahman, Mekky, Ali, Tilga, Sergei, Fedorova, Natalia, Artemova, Ekaterina, Aldarmaki, Hanan, Kementchedjhieva, Yova

arXiv.org Artificial IntelligenceMar-27-2025

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.2191

Country:

Asia (1.00)
Africa > Middle East > Morocco (1.00)
Africa > Middle East > Egypt (1.00)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (0.94)
Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

Add feedback

Infant Cry Detection Using Causal Temporal Representation

Fu, Minghao, Li, Danning, Gadhiya, Aryan, Lambright, Benjamin, Alowais, Mohamed, Bahnassy, Mohab, Elletter, Saad El Dine, Toyin, Hawau Olamide, Jiang, Haiyan, Zhang, Kun, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceMar-8-2025

Identifying relevant audio features in domestic Caring for newborns, especially for first-time parents, is a environments is challenging due to diverse background sounds significant challenge. One of the main difficulties is understanding and the limited availability of high-quality annotated data the meaning of infant cries. In response, numerous for specific cases like baby cries. We address this issue studies have emerged to address this problem. Early research through manual annotation and data augmentation techniques, showed that trained adult listeners could differentiate between improving baby cry analysis models by reducing noise during types of cries. For example, [1] first identified four types of cry interval extraction. In addition, as the acquisition of cries (pain, hunger, birth, and pleasure) by training nurses annotated data is both costly and challenging, we propose a to recognize them. However, at best, the accuracy of trained viable alternative using unsupervised methods to detect infant nurses is only up to 33.09%. Beyond recognizing infants' daily cry segment boundaries by approximating the underlying needs, disease prediction is another critical task in infant cry data-generating process.

artificial intelligence, detection, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.06247

Country: North America (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

SparQLe: Speech Queries to Text Translation Through LLMs

Djanibekov, Amirbek, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceFeb-13-2025

With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.

artificial intelligence, large language model, natural language, (5 more...)

arXiv.org Artificial Intelligence

2502.09284

Genre: Research Report > Promising Solution (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Dialectal Coverage And Generalization in Arabic Speech Recognition

Djanibekov, Amirbek, Toyin, Hawau Olamide, Alshalan, Raghad, Alitr, Abdullah, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceDec-4-2024

Developing robust automatic speech recognition (ASR) systems for Arabic, a language characterized by its rich dialectal diversity and often considered a low-resource language in speech technology, demands effective strategies to manage its complexity. This study explores three critical factors influencing ASR performance: the role of dialectal coverage in pre-training, the effectiveness of dialect-specific fine-tuning compared to a multi-dialectal approach, and the ability to generalize to unseen dialects. Through extensive experiments across different dialect combinations, our findings offer key insights towards advancing the development of ASR systems for pluricentric languages like Arabic.

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2411.05872

Country:

Africa (0.93)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > Middle East > Saudi Arabia (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

STTATTS: Unified Speech-To-Text And Text-To-Speech Model

Toyin, Hawau Olamide, Li, Hao, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceOct-24-2024

Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs ($\sim$50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research.

artificial intelligence, experiment, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.18607

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Aldarmaki, Ibrahim, Solorio, Thamar, Raj, Bhiksha, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceOct-7-2024

Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.

artificial intelligence, machine learning, speech enhancement, (16 more...)

arXiv.org Artificial Intelligence

2410.05019

Country: Europe > Germany (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

PALM: Few-Shot Prompt Learning for Audio Language Models

Hanif, Asif, Agro, Maha Tufail, Qazi, Mohammad Areeb, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceSep-29-2024

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.19806

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Mixat: A Data Set of Bilingual Emirati-English Speech

Ali, Maryam Al, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceMay-4-2024

This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.

artificial intelligence, natural language, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

2405.02578

Country:

Asia > Middle East (0.47)
North America > United States > Texas (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)

Add feedback

Spoken Word2Vec: A Perspective And Some Techniques

Sayeed, Mohammad Amaan, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceNov-15-2023

Text word embeddings that encode distributional semantic features work by modeling contextual similarities of frequently occurring words. Acoustic word embeddings, on the other hand, typically encode low-level phonetic similarities. Semantic embeddings for spoken words have been previously explored using similar algorithms to Word2Vec, but the resulting vectors still mainly encoded phonetic rather than semantic features. In this paper, we examine the assumptions and architectures used in previous works and show experimentally how Word2Vec algorithms fail to encode distributional semantics when the input units are acoustically correlated. In addition, previous works relied on the simplifying assumptions of perfect word segmentation and clustering by word type. Given these conditions, a trivial solution identical to text-based embeddings has been overlooked. We follow this simpler path using automatic word type clustering and examine the effects on the resulting embeddings, highlighting the true challenges in this task.

correlation, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2311.09319

Country:

North America > United States > Louisiana (0.14)
Europe > Italy (0.14)
Europe > Czechia (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)

Add feedback

Automatic Restoration of Diacritics for Speech Data Sets

Shatnawi, Sara, Alqahtani, Sawsan, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceNov-15-2023

Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing the parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for a transformer-based diacritic restoration model. The proposed model consistently improve diacritic restoration performance compared to an equivalent text-only model, with at least 5\% absolute reduction in diacritic error rate within the same domain and on two out-of-domain test sets. Our results underscore the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.10771

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback