Chowdhury, Shammur Absar
BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
Basher, Mohammad Jahid Ibna, Kowsher, Md, Islam, Md Saiful, Nandi, Rabindra Nath, Prottasha, Nusrat Jahan, Menon, Mehadi Hasan, Muntasir, Tareq Al, Chowdhury, Shammur Absar, Alam, Firoj, Yousefi, Niloofar, Garibay, Ozlem Ozmen
This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis using minimal training data. Building upon the XTTS architecture, our approach integrates Bangla into a multilingual TTS pipeline, with modifications to account for the phonetic and linguistic characteristics of the language. We pre-train BnTTS on 3.85k hours of Bangla speech dataset with corresponding text labels and evaluate performance in both zero-shot and few-shot settings on our proposed test dataset. Empirical evaluations in few-shot settings show that BnTTS significantly improves the naturalness, intelligibility, and speaker fidelity of synthesized Bangla speech. Compared to state-of-the-art Bangla TTS systems, BnTTS exhibits superior performance in Subjective Mean Opinion Score (SMOS), Naturalness, and Clarity metrics.
GenAI Content Detection Task 2: AI vs. Human -- Academic Essay Authenticity Challenge
Chowdhury, Shammur Absar, Almerekhi, Hind, Kutlu, Mucahid, Keles, Kaan Efe, Ahmad, Fatema, Mohiuddin, Tasnim, Mikros, George, Alam, Firoj
This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs
Hasan, Md. Arid, Hasanain, Maram, Ahmad, Fatema, Laskar, Sahinur Rahman, Upadhyay, Sunaya, Sukhadia, Vrunda N, Kutlu, Mucahid, Chowdhury, Shammur Absar, Alam, Firoj
Natural Question Answering (QA) datasets play a crucial role in developing and evaluating the capabilities of large language models (LLMs), ensuring their effective usage in real-world applications. Despite the numerous QA datasets that have been developed, there is a notable lack of region-specific datasets generated by native users in their own languages. This gap hinders the effective benchmarking of LLMs for regional and cultural specificities. In this study, we propose a scalable framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. Moreover, to demonstrate the efficacy of the proposed framework, we designed a multilingual natural QA dataset, MultiNativQA, consisting of ~72K QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers covering 18 topics. We benchmark the MultiNativQA dataset with open- and closed-source LLMs. We made both the framework NativQA and MultiNativQA dataset publicly available for the community. (https://nativqa.gitlab.io)
Children's Speech Recognition through Discrete Token Enhancement
Sukhadia, Vrunda N., Chowdhury, Shammur Absar
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
Nandi, Rabindra Nath, Menon, Mehadi Hasan, Muntasir, Tareq Al, Sarker, Sagor, Muhtaseem, Quazi Sarwar, Islam, Md. Tariqul, Chowdhury, Shammur Absar, Alam, Firoj
One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
Automatic Pronunciation Assessment -- A Review
Kheir, Yassine El, Ali, Ahmed, Chowdhury, Shammur Absar
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.
L1-aware Multilingual Mispronunciation Detection Framework
Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence. First, an attention mechanism is deployed to align the input audio with the reference phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an auxiliary model, pretrained in a multi-task setup identifying L1 and L2 language, and are infused with the primary network. Finally, the L1-MultiMDD is then optimized for a unified multilingual phoneme recognition task using connectionist temporal classification (CTC) loss for the target languages: English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent gains in PER, and false rejection rate (FRR) across all target languages confirm our approach's robustness, efficacy, and generalizability.
The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
Numerous investigations have explored a range of features and modeling approaches aimed at enhancing modeling Research on pronunciation assessment systems focuses performance. These explorations have encompassed the utilization on utilizing phonetic and phonological aspects of non-native of Goodness-of-Pronunciation (GOP) metrics [4, 5, (L2) speech, often neglecting the rich layer of information 6], the integration of manually crafted handful of non-verbal hidden within the non-verbal cues. In this study, we proposed features such as duration, energy, and pitch [7, 8, 9], as well a novel pronunciation assessment framework, IntraVerbalPA.
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
Dalvi, Fahim, Hasanain, Maram, Boughorbel, Sabri, Mousi, Basel, Abdaljalil, Samir, Nazar, Nizi, Abdelali, Ahmed, Chowdhury, Shammur Absar, Mubarak, Hamdy, Ali, Ahmed, Hawasly, Majd, Durrani, Nadir, Alam, Firoj
The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework. Initially developed to evaluate Arabic NLP tasks using OpenAI's GPT and BLOOM models; it can be seamlessly customized for any NLP task and model, regardless of language. The framework also features zero- and few-shot learning settings. A new custom dataset can be added in less than 10 minutes, and users can use their own model API keys to evaluate the task at hand. The developed framework has been already tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We plan to open-source the framework for the community (https://github.com/qcri/LLMeBench/). A video demonstrating the framework is available online (https://youtu.be/FkQn4UjYA0s).
MyVoice: Arabic Speech Resource Collaboration Platform
Elshahawy, Yousseif, Kheir, Yassine El, Chowdhury, Shammur Absar, Ali, Ahmed
We introduce MyVoice, a crowdsourcing platform designed to collect Arabic speech to enhance dialectal speech technologies. This platform offers an opportunity to design large dialectal speech datasets; and makes them publicly available. MyVoice allows contributors to select city/country-level fine-grained dialect and record the displayed utterances. Users can switch roles between contributors and annotators. The platform incorporates a quality assurance system that filters out low-quality and spurious recordings before sending them for validation. During the validation phase, contributors can assess the quality of recordings, annotate them, and provide feedback which is then reviewed by administrators. Furthermore, the platform offers flexibility to admin roles to add new data or tasks beyond dialectal speech and word collection, which are displayed to contributors. Thus, enabling collaborative efforts in gathering diverse and large Arabic speech data.