Goto

Collaborating Authors

 ramakrishnan


ORIENT: SubmodularMutualInformationMeasures forDataSubsetSelectionunderDistributionShift

Neural Information Processing Systems

The recent success of deep learning frameworks in applications such as image classification [9], speech recognition [20], and object detection [13] stems primarily from the availability of large amounts of labeled data.


Automatic Speech Recognition for Sanskrit with Transfer Learning

arXiv.org Artificial Intelligence

Sanskrit, one of humanity's most ancient languages, has a vast collection of books and manuscripts on diverse topics that have been accumulated over millennia. However, its digital content (audio and text), which is vital for the training of AI systems, is profoundly limited. Furthermore, its intricate linguistics make it hard to develop robust NLP tools for wider accessibility. Given these constraints, we have developed an automatic speech recognition model for Sanskrit by employing transfer learning mechanism on OpenAI's Whisper model. After carefully optimising the hyper-parameters, we obtained promising results with our transfer-learned model achieving a word error rate of 15.42% on Vaksancayah dataset. An online demo of our model is made available for the use of public and to evaluate its performance firsthand thereby paving the way for improved accessibility and technological support for Sanskrit learning in the modern era.


Automatic Speech Recognition for Hindi

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.


Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

arXiv.org Artificial Intelligence

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). More than 800 participants from 13 cities worldwide performed these activities in 131 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,422 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources will be open sourced to fuel new research in the community.


Learning-Augmented Model-Based Planning for Visual Exploration

arXiv.org Artificial Intelligence

We consider the problem of time-limited robotic exploration in previously unseen environments where exploration is limited by a predefined amount of time. We propose a novel exploration approach using learning-augmented model-based planning. We generate a set of subgoals associated with frontiers on the current map and derive a Bellman Equation for exploration with these subgoals. Visual sensing and advances in semantic mapping of indoor scenes are exploited for training a deep convolutional neural network to estimate properties associated with each frontier: the expected unobserved area beyond the frontier and the expected timesteps (discretized actions) required to explore it. The proposed model-based planner is guaranteed to explore the whole scene if time permits. We thoroughly evaluate our approach on a large-scale pseudo-realistic indoor dataset (Matterport3D) with the Habitat simulator. We compare our approach with classical and more recent RL-based exploration methods. Our approach surpasses the greedy strategies by 2.1% and the RL-based exploration methods by 8.4% in terms of coverage.


Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning

arXiv.org Artificial Intelligence

End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.


IT enters the era of intelligent automation

#artificialintelligence

Since the outset of the pandemic, organizations have been increasingly launching initiatives aimed at automating business processes, turning to technologies such as robotic process automation (RPA) in efforts to reduce costs, speed up tasks, and improve accuracy of core business operations. Some leading organizations, however, are not stopping there. Seeking to push their automation agendas forward, they are embracing a move toward broader "intelligent automation" (IA), a strategy that weaves capabilities such as artificial intelligence (AI) and machine learning (ML) into standard RPA to enhance its functionality. In addition to RPA, AI, and ML, intelligent automation strategies can also incorporate a mix of technologies such as natural language processing, chatbots, and others that complement each other, says Lakshmanan Chidambaram, president of Americas strategic verticals at global IT consulting firm Tech Mahindra. "These technologies together allow us to automate business processes to a larger extent, when compared to simple RPA automations," Chidambaram says.


Towards Improving Adversarial Training of NLP Models

arXiv.org Artificial Intelligence

Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at https://github.com/QData/Textattack-A2T .


AI's human protein database a 'great leap' for research - Tech Wire Asia

#artificialintelligence

Scientists last month unveiled the most exhaustive database yet of the proteins that form the building blocks of life, in a breakthrough where observers said would "fundamentally change biological research". Every cell in every living organism is triggered to perform its function by proteins that deliver constant instructions to maintain health and ward off infection. Unlike the genome -- the complete sequence of human genes that encode cellular life -- the human proteome is constantly changing in response to genetic instructions and environmental stimuli. Understanding how proteins operate -- the shape in which they end up, or "fold" into -- within cells has fascinated scientists for decades. But determining each protein's precise function through direct experimentation is painstaking.


AI's human protein database a 'great leap' for research

#artificialintelligence

Scientists on Thursday unveiled the most exhaustive database yet of the proteins that form the building blocks of life, in a breakthrough observers said would "fundamentally change biological research". Every cell in every living organism is triggered to perform its function by proteins that deliver constant instructions to maintain health and ward off infection. Unlike the genome -- the complete sequence of human genes that encode cellular life -- the human proteome is constantly changing in response to genetic instructions and environmental stimuli. Understanding how proteins operate -- the shape in which they end up, or "fold" into -- within cells has fascinated scientists for decades. But determining each protein's precise function through direct experimentation is painstaking.