AITopics | Jain, Yash

Collaborating Authors

Jain, Yash

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RiTTA: Modeling Event Relations in Text-to-Audio Generation

He, Yuhang, Jain, Yash, Liu, Xubo, Markham, Andrew, Vineet, Vibhav

arXiv.org Artificial IntelligenceJan-4-2025

Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA

machine learning, natural language, relation, (18 more...)

arXiv.org Artificial Intelligence

2412.15922

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.68)
(2 more...)

Add feedback

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Jain, Yash, Chan, David, Dheram, Pranav, Khare, Aparna, Shonibare, Olabanji, Ravichandran, Venkatesh, Ghosh, Shalini

arXiv.org Artificial IntelligenceMar-28-2024

Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

artificial intelligence, machine learning, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2403.19822

Country:

North America > United States > California (0.14)
Asia (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

PEEKABOO: Interactive Video Generation via Masked-Diffusion

Jain, Yash, Nasery, Anshul, Vineet, Vibhav, Behl, Harkirat

arXiv.org Artificial IntelligenceDec-12-2023

Recently there has been a lot of progress in text-to-video generation, with state-of-the-art models being capable of generating high quality, realistic videos. However, these models lack the capability for users to interactively control and generate videos, which can potentially unlock new areas of application. As a first step towards this goal, we tackle the problem of endowing diffusion-based video generation models with interactive spatio-temporal control over their output. To this end, we take inspiration from the recent advances in segmentation literature to propose a novel spatio-temporal masked attention module - Peekaboo. This module is a training-free, no-inference-overhead addition to off-the-shelf video generation models which enables spatio-temporal control. We also propose an evaluation benchmark for the interactive video generation task. Through extensive qualitative and quantitative evaluation, we establish that Peekaboo enables control video generation and even obtains a gain of upto 3.8x in mIoU over baseline models.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2312.07509

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off

Kuhar, Sachit, Jain, Yash, Tumanov, Alexey

arXiv.org Artificial IntelligenceDec-3-2023

Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed Binarization, a unified co-design framework that synergistically integrates hardware-software systems, quantization functions, and representation learning techniques to address this trade-off. Our results demonstrate that Signed Binarization is more accurate than binarization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters, both of the same type, for a DNN block. Finally, our approach achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods for ResNet 18, presenting an alternative solution for deploying efficient models in resource-limited environments.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2312.01581

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

ColloSSL: Collaborative Self-Supervised Learning for Human Activity Recognition

Jain, Yash, Tang, Chi Ian, Min, Chulhong, Kawsar, Fahim, Mathur, Akhil

arXiv.org Artificial IntelligenceFeb-1-2022

A major bottleneck in training robust Human-Activity Recognition models (HAR) is the need for large-scale labeled sensor datasets. Because labeling large amounts of sensor data is an expensive task, unsupervised and semi-supervised learning techniques have emerged that can learn good features from the data without requiring any labels. In this paper, we extend this line of research and present a novel technique called Collaborative Self-Supervised Learning (ColloSSL) which leverages unlabeled data collected from multiple devices worn by a user to learn high-quality features of the data. A key insight that underpins the design of ColloSSL is that unlabeled sensor datasets simultaneously captured by multiple devices can be viewed as natural transformations of each other, and leveraged to generate a supervisory signal for representation learning. We present three technical innovations to extend conventional self-supervised learning algorithms to a multi-device setting: a Device Selection approach which selects positive and negative devices to enable contrastive learning, a Contrastive Sampling algorithm which samples positive and negative examples in a multi-device setting, and a loss function called Multi-view Contrastive Loss which extends standard contrastive loss to a multi-device setting. Our experimental results on three multi-device datasets show that ColloSSL outperforms both fully-supervised and semi-supervised learning techniques in majority of the experiment settings, resulting in an absolute increase of upto 7.9% in F_1 score compared to the best performing baselines. We also show that ColloSSL outperforms the fully-supervised methods in a low-data regime, by just using one-tenth of the available labeled data in the best case.

machine learning, teaching medhods, teaching method, (19 more...)

arXiv.org Artificial Intelligence

2202.00758

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Wisconsin (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.93)
Research Report > Promising Solution (0.65)

Industry:

Information Technology (1.00)
Education (0.93)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Integrating Transductive And Inductive Embeddings Improves Link Prediction Accuracy

Gupta, Chitrank, Jain, Yash, De, Abir, Chakrabarti, Soumen

arXiv.org Artificial IntelligenceAug-23-2021

In recent years, inductive graph embedding models, \emph{viz.}, graph neural networks (GNNs) have become increasingly accurate at link prediction (LP) in online social networks. The performance of such networks depends strongly on the input node features, which vary across networks and applications. Selecting appropriate node features remains application-dependent and generally an open question. Moreover, owing to privacy and ethical issues, use of personalized node features is often restricted. In fact, many publicly available data from online social network do not contain any node features (e.g., demography). In this work, we provide a comprehensive experimental analysis which shows that harnessing a transductive technique (e.g., Node2Vec) for obtaining initial node representations, after which an inductive node embedding technique takes over, leads to substantial improvements in link prediction accuracy. We demonstrate that, for a wide variety of GNN variants, node representation vectors obtained from Node2Vec serve as high quality input features to GNNs, thereby improving LP performance.

information management, neural network, node, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3459637.3482125

2108.10108

Country: North America > United States > California > Santa Clara County (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.70)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
(2 more...)

Add feedback