AITopics

2503.23274

Country: Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

arXiv.org Artificial IntelligenceJun-5-2024

Hi5: 2D Hand Pose Estimation with Zero Human Annotation

Hasan, Masum, Ozel, Cengiz, Long, Nina, Martin, Alexander, Potter, Samuel, Adnan, Tariq, Lee, Sangwu, Zadeh, Amir, Hoque, Ehsan

We propose a new large synthetic hand pose estimation dataset, Hi5, and a novel inexpensive method for collecting high-quality synthetic data that requires no human annotation or validation. Leveraging recent advancements in computer graphics, high-fidelity 3D hand models with diverse genders and skin colors, and dynamic environments and camera movements, our data synthesis pipeline allows precise control over data diversity and representation, ensuring robust and fair model training. We generate a dataset with 583,000 images with accurate pose annotation using a single consumer PC that closely represents real-world variability. Pose estimation models trained with Hi5 perform competitively on real-hand benchmarks while surpassing models trained with real data when tested on occlusions and perturbations. Our experiments show promising results for synthetic data as a viable solution for data representation problems in real datasets. Overall, this paper provides a promising new approach to synthetic data creation and annotation that can reduce costs and increase the diversity and quality of data for hand pose estimation.

artificial intelligence, deep learning, machine learning, (19 more...)

2406.03599

Country: North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Media > Television (0.34)
Media > Photography (0.34)
Media > Film (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceMar-2-2023

Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

Li, Yingting, Mehrish, Ambuj, Zhao, Shuai, Bhardwaj, Rishabh, Zadeh, Amir, Majumder, Navonil, Mihalcea, Rada, Poria, Soujanya

Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models. Parameter inefficiency can however arise when, during transfer learning, all the parameters of a large pre-trained model need to be updated for individual downstream tasks. As the number of parameters grows, fine-tuning is prone to overfitting and catastrophic forgetting. In addition, full fine-tuning can become prohibitively expensive when the model is used for many tasks. To mitigate this issue, parameter-efficient transfer learning algorithms, such as adapters and prefix tuning, have been proposed as a way to introduce a few trainable parameters that can be plugged into large pre-trained language models such as BERT, and HuBERT. In this paper, we introduce the Speech UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech-processing tasks. Additionally, we introduce a new adapter, ConvAdapter, based on 1D convolution. We show that ConvAdapter outperforms the standard adapters while showing comparable performance against prefix tuning and LoRA with only 0.94% of trainable parameters on some of the task in SURE. We further explore the effectiveness of parameter efficient transfer learning for speech synthesis task such as Text-to-Speech (TTS).

adapter, artificial intelligence, machine learning, (14 more...)

2303.03267

Country: Asia (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceFeb-20-2023

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Liang, Paul Pu, Zadeh, Amir, Morency, Louis-Philippe

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions that have driven subsequent innovations, and propose a taxonomy of six core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

artificial intelligence, machine learning, natural language, (17 more...)

2209.0343

Country: North America > United States (1.00)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine > Health Care Technology (0.92)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.92)

arXiv.org Machine LearningOct-26-2021

Relay Variational Inference: A Method for Accelerated Encoderless VI

Zadeh, Amir, Benoit, Santiago, Morency, Louis-Philippe

Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g.

artificial intelligence, machine learning, neural network, (18 more...)

2110.13422

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Asia (0.14)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

arXiv.org Artificial IntelligenceJul-28-2021

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Han, Wei, Chen, Hui, Gelbukh, Alexander, Zadeh, Amir, Morency, Louis-philippe, Poria, Soujanya

Multimodal sentiment analysis aims to extract and integrate semantic information collected from multiple modalities to recognize the expressed emotions and sentiment in multimodal data. This research area's major concern lies in developing an extraordinary fusion scheme that can extract and integrate key information from various modalities. However, one issue that may restrict previous work to achieve a higher level is the lack of proper modeling for the dynamics of the competition between the independence and relevance among modalities, which could deteriorate fusion outcomes by causing the collapse of modality-specific feature space or introducing extra noise. To mitigate this, we propose the Bi-Bimodal Fusion Network (BBFN), a novel end-to-end network that performs fusion (relevance increment) and separation (difference increment) on pairwise modality representations. The two parts are trained simultaneously such that the combat between them is simulated. The model takes two bimodal pairs as input due to the known information imbalance among modalities. In addition, we leverage a gated control mechanism in the Transformer architecture to further improve the final output. Experimental results on three datasets (CMU-MOSI, CMU-MOSEI, and UR-FUNNY) verifies that our model significantly outperforms the SOTA. The implementation of this work is available at https://github.com/declare-lab/BBFN.

deep learning, modality, neural network, (17 more...)

2107.13669

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJan-3-2021

StarNet: Gradient-free Training of Deep Generative Models using Determined System of Linear Equations

Zadeh, Amir, Benoit, Santiago, Morency, Louis-Philippe

In this paper we present an approach for training deep generative models solely based on solving determined systems of linear equations. A network that uses this approach, called a StarNet, has the following desirable properties: 1) training requires no gradient as solution to the system of linear equations is not stochastic, 2) is highly scalable when solving the system of linear equations w.r.t the latent codes, and similarly for the parameters of the model, and 3) it gives desirable least-square bounds for the estimation of latent codes and network parameters within each layer.

deep learning, neural network, starnet, (21 more...)

2101.00574

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.60)

arXiv.org Machine LearningJul-7-2020

What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets

Yang, Jianing, Zhu, Yuying, Wang, Yongxin, Yi, Ruitao, Zadeh, Amir, Morency, Louis-Philippe

Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts and jeopardize the model's ability to generalize. Understanding how strong these QA biases are and where they come from helps the community measure progress more accurately and provide researchers insights to debug their models. In this paper, we analyze QA biases in popular video question answering datasets and discover pretrained language models can answer 37-48% questions correctly without using any multimodal context information, far exceeding the 20% random guess baseline for 5-choose-1 multiple-choice questions. Our ablation study shows biases can come from annotators and type of questions. Specifically, annotators that have been seen during training are better predicted by the model and reasoning, abstract questions incur more biases than factual, direct questions. We also show empirically that using annotator-non-overlapping train-test splits can reduce QA biases for video QA datasets.

annotator, artificial intelligence, natural language, (17 more...)

2007.03626

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.94)

arXiv.org Machine LearningAug-15-2019

M-BERT: Injecting Multimodal Information in the BERT Structure

Rahman, Wasifur, Hasan, Md Kamrul, Zadeh, Amir, Morency, Louis-Philippe, Hoque, Mohammed Ehsan

Multimodal language analysis is an emerging research area in natural language processing that models language in a multimodal manner. It aims to understand language from the modalities of text, visual, and acoustic by modeling both intra-modal and cross-modal interactions. BERT (Bidirectional Encoder Representations from Transformers) provides strong contextual language representations after training on large-scale unlabeled corpora. Fine-tuning the vanilla BERT model has shown promising results in building state-of-the-art models for diverse NLP tasks like question answering and language inference. However, fine-tuning BERT in the presence of information from other modalities remains an open research problem. In this paper, we inject multimodal information within the input space of BERT network for modeling multimodal language. The proposed injection method allows BERT to reach a new state of the art of $84.38\%$ binary accuracy on CMU-MOSI dataset (multimodal sentiment analysis) with a gap of 5.98 percent to the previous state of the art and 1.02 percent to the text-only BERT.

artificial intelligence, louis-philippe morency, neural network, (17 more...)

1908.05787

Country: North America > United States (0.48)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Machine LearningApr-13-2019

UR-FUNNY: A Multimodal Language Dataset for Understanding Humor

Hasan, Md Kamrul, Rahman, Wasifur, Zadeh, Amir, Zhong, Jianyuan, Tanveer, Md Iftekhar, Morency, Louis-Philippe, Mohammed, null, Hoque, null

Humor is a unique and creative communicative behavior displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (vision) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it is an understudied area. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.

dataset, deep learning, neural network, (23 more...)

1904.06618

Country:

North America > United States (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.46)