AITopics | Kulkarni, Ajinkya

Collaborating Authors

Kulkarni, Ajinkya

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems

Kulkarni, Ajinkya, Kulkarni, Atharva, Couceiro, Miguel, Trancoso, Isabel

arXiv.org Artificial IntelligenceMar-2-2025

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems Ajinkya Kulkarni 1, 2, Atharva Kulkarni 3, Miguel Couceiro 4, 5, Isabel Trancoso 5 1 IDIAP, Switzerland, 2 MBZUAI, UAE, 3 Erisha Labs, India 4 Universit e de Lorraine, CNRS, LORIA, Nancy, France 5 INESC-ID, IST, Universidade de Lisboa, Portugal ajinkya.kulkarni@idiap.ch Abstract In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOT A) performances. Despite their improved performance in controlled settings, there remains a critical gap in understanding their efficacy and equity in real-world scenarios. In addition, we examine the environmental impact of ASR systems, scrutinizing the use of large acoustic models on carbon emission and energy consumption. We also provide insights into our empirical analyses, offering a valuable contribution to the claims surrounding bias and sustainability in ASR systems. Index T erms: ASR, Bias, carbon footprint, sustainability 1. Introduction The advent of large deep neural networks (DNNs) has brought about substantial advancements in various speech-processing applications, notably in speech recognition.

artificial intelligence, energy consumption, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2024-2494

2503.00907

Country:

Asia > India (0.34)
Europe > Portugal > Lisbon > Lisbon (0.24)
Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.24)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas (0.37)
Law > Environmental Law (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

Hajal, Karl El, Hermann, Enno, Kulkarni, Ajinkya, -Doss, Mathew Magimai.

arXiv.org Artificial IntelligenceJan-17-2025

Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .

artificial intelligence, machine learning, speech, (16 more...)

arXiv.org Artificial Intelligence

2501.10256

Country: Europe > Switzerland (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese

Kulkarni, Ajinkya, Tokareva, Anna, Qureshi, Rameez, Couceiro, Miguel

arXiv.org Artificial IntelligenceFeb-12-2024

In the field of spoken language understanding, systems like Whisper and Multilingual Massive Speech (MMS) have shown state-of-the-art performances. This study is dedicated to a comprehensive exploration of the Whisper and MMS systems, with a focus on assessing biases in automatic speech recognition (ASR) inherent to casual conversation speech specific to the Portuguese language. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. Alongside traditional ASR evaluation metrics such as Word Error Rate (WER), we have incorporated p-value statistical significance for gender bias analysis. Furthermore, we extensively examine the impact of data distribution and empirically show that oversampling techniques alleviate such stereotypical biases. This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems' performance in multilingual settings.

artificial intelligence, asr system, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2402.07513

Country:

South America > Brazil (1.00)
Europe (1.00)
Asia (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report > Experimental Study (0.54)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Training Convolutional Neural Networks with the Forward-Forward algorithm

Scodellaro, Riccardo, Kulkarni, Ajinkya, Alves, Frauke, Schröter, Matthias

arXiv.org Artificial IntelligenceJan-7-2024

The recent successes in analyzing images with deep neural networks are almost exclusively achieved with Convolutional Neural Networks (CNNs). The training of these CNNs, and in fact of all deep neural network architectures, uses the backpropagation algorithm where the output of the network is compared with the desired result and the difference is then used to tune the weights of the network towards the desired outcome. In a 2022 preprint, Geoffrey Hinton suggested an alternative way of training which passes the desired results together with the images at the input of the network. This so called Forward Forward (FF) algorithm has up to now only been used in fully connected networks. In this paper, we show how the FF paradigm can be extended to CNNs. Our FF-trained CNN, featuring a novel spatially-extended labeling technique, achieves a classification accuracy of 99.16% on the MNIST hand-written digits dataset. We show how different hyperparameters affect the performance of the proposed algorithm and compare the results with CNN trained with the standard backpropagation approach. Furthermore, we use Class Activation Maps to investigate which type of features are learnt by the FF algorithm.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2312.14924

Country: Europe > Germany > Lower Saxony (0.14)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.46)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ArTST: Arabic Text and Speech Transformer

Toyin, Hawau Olamide, Djanibekov, Amirbek, Kulkarni, Ajinkya, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceOct-25-2023

We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification. In our experiments comparing ArTST with SpeechT5, as well as with previously reported results in these tasks, ArTST performs on a par with or exceeding the current state-of-the-art in all three tasks. Moreover, we find that our pre-training is conducive for generalization, which is particularly evident in the low-resource TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models are released for research use.

arabic text and speech transformer, speech recognition, speech synthesis, (2 more...)

arXiv.org Artificial Intelligence

2310.16621

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.53)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.53)

Add feedback

Adapting the adapters for code-switching in multilingual ASR

Kulkarni, Atharva, Kulkarni, Ajinkya, Couceiro, Miguel, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceOct-11-2023

Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent improvements in code-switching performance with at least 10\% absolute reduction in CER across all test sets.

artificial intelligence, multilingual asr, speech recognition, (1 more...)

arXiv.org Artificial Intelligence

2310.07423

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.53)

Add feedback

Another Generic Setting for Entity Resolution: Basic Theory

Guo, Xiuzhan, Berrill, Arthur, Kulkarni, Ajinkya, Belezko, Kostya, Luo, Min

arXiv.org Artificial IntelligenceMar-12-2023

Benjelloun et al. \cite{BGSWW} considered the Entity Resolution (ER) problem as the generic process of matching and merging entity records judged to represent the same real world object. They treated the functions for matching and merging entity records as black-boxes and introduced four important properties that enable efficient generic ER algorithms. In this paper, we shall study the properties which match and merge functions share, model matching and merging black-boxes for ER in a partial groupoid, based on the properties that match and merge functions satisfy, and show that a partial groupoid provides another generic setting for ER. The natural partial order on a partial groupoid is defined when the partial groupoid satisfies Idempotence and Catenary associativity. Given a partial order on a partial groupoid, the least upper bound and compatibility ($LU_{pg}$ and $CP_{pg}$) properties are equivalent to Idempotence, Commutativity, Associativity, and Representativity and the partial order must be the natural one we defined when the domain of the partial operation is reflexive. The partiality of a partial groupoid can be reduced using connected components and clique covers of its domain graph, and a noncommutative partial groupoid can be mapped to a commutative one homomorphically if it has the partial idempotent semigroup like structures. In a finitely generated partial groupoid $(P,D,\circ)$ without any conditions required, the ER we concern is the full elements in $P$. If $(P,D,\circ)$ satisfies Idempotence and Catenary associativity, then the ER is the maximal elements in $P$, which are full elements and form the ER defined in \cite{BGSWW}. Furthermore, in the case, since there is a transitive binary order, we consider ER as ``sorting, selecting, and querying the elements in a finitely generated partial groupoid."

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2303.06629

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.62)

Add feedback

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

Kulkarni, Ajinkya, Kulkarni, Atharva, Shatnawi, Sara Abedalmonem Mohammad, Aldarmaki, Hanan

arXiv.org Artificial IntelligenceFeb-28-2023

At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at www.clartts.com for research purposes, along with the baseline TTS systems demo.

artificial intelligence, corpus, optical character recognition, (15 more...)

arXiv.org Artificial Intelligence

2303.00069

Genre: Research Report (0.82)

Industry: Media > Publishing (0.57)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback