AITopics | Gmyr, Robert

Collaborating Authors

Gmyr, Robert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Yang, Ziyi, Khademi, Mahmoud, Xu, Yichong, Pryzant, Reid, Fang, Yuwei, Zhu, Chenguang, Chen, Dongdong, Qian, Yao, Gao, Mei, Chen, Yi-Ling, Gmyr, Robert, Kanda, Naoyuki, Codella, Noel, Xiao, Bin, Shi, Yu, Yuan, Lu, Yoshioka, Takuya, Zeng, Michael, Huang, Xuedong

arXiv.org Artificial IntelligenceMay-20-2023

The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is an integrative system that leverages state-of-the-art single-modality encoders, combining their outputs with a new modality-fusing encoder in order to flexibly project combinations of modalities into a shared representational space. Next, language tokens are generated from these representations via an autoregressive decoder. The whole framework is pretrained end-to-end on a large collection of dual- and single-modality datasets using a novel text completion objective that can be generalized across arbitrary combinations of modalities. i-Code V2 matches or outperforms state-of-the-art single- and dual-modality baselines on 7 multimodal tasks, demonstrating the power of generative multimodal pretraining across a diversity of tasks and signals.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.12311

Country:

North America > Canada (0.14)
Asia > Middle East > Israel (0.14)
Asia > China (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

Kumatani, Kenichi, Gmyr, Robert, Salinas, Felipe Cruz, Liu, Linquan, Zuo, Wei, Patel, Devang, Sun, Eric, Shi, Yu

arXiv.org Artificial IntelligenceJan-4-2022

The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence Transformer (S2S-T) and Transformer Transducer (T-T). We demonstrate through a set of ASR experiments on multiple language data that the MoE networks can reduce the relative word error rates by 16.3% and 4.6% with the S2S-T and T-T, respectively. Moreover, we thoroughly investigate the effect of the MoE on the T-T architecture in various conditions: streaming mode, non-streaming mode, Figure 1: Schematic diagram of MoE Transformer encoder the use of language ID and the label decoder with the MoE.

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2112.0582

Country: Europe > Germany (0.14)

Genre: Research Report (0.50)

Industry: Telecommunications (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Sequence-level self-learning with multiple hypotheses

Kumatani, Kenichi, Dimitriadis, Dimitrios, Gaur, Yashesh, Gmyr, Robert, Eskimez, Sefik Emre, Li, Jinyu, Zeng, Michael

arXiv.org Artificial IntelligenceDec-10-2021

In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning approaches, we adopt the \emph{multi-task learning} (MTL) framework where the $n$-th best ASR hypothesis is used as the label of each task. The seq2seq network is updated through the MTL framework so as to find the common representation that can cover multiple hypotheses. By doing so, the effect of the \emph{hard-decision} errors can be alleviated. We first demonstrate the effectiveness of our self-learning methods through ASR experiments in an accent adaptation task between the US and British English speech. Our experiment results show that our method can reduce the WER on the British speech data from 14.55\% to 10.36\% compared to the baseline model trained with the US English data only. Moreover, we investigate the effect of our proposed methods in a federated learning scenario.

machine learning, teaching medhods, teaching method, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2020-2020

2112.05826

Country:

Europe (0.46)
North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)

Add feedback

Federated Transfer Learning with Dynamic Gradient Aggregation

Dimitriadis, Dimitrios, Kumatani, Kenichi, Gmyr, Robert, Gaur, Yashesh, Eskimez, Sefik Emre

arXiv.org Machine LearningAug-6-2020

In this paper, a Federated Learning (FL) simulation platform is introduced. The target scenario is Acoustic Model training based on this platform. To our knowledge, this is the first attempt to apply FL techniques to Speech Recognition tasks due to the inherent complexity. The proposed FL platform can support different tasks based on the adopted modular design. As part of the platform, a novel hierarchical optimization scheme and two gradient aggregation methods are proposed, leading to almost an order of magnitude improvement in training convergence speed compared to other distributed or FL training algorithms like BMUF and FedAvg. The hierarchical optimization offers additional flexibility in the training pipeline besides the enhanced convergence speed. On top of the hierarchical optimization, a dynamic gradient aggregation algorithm is proposed, based on a data-driven weight inference. This aggregation algorithm acts as a regularizer of the gradient quality. Finally, an unsupervised training pipeline tailored to FL is presented as a separate training scenario. The experimental validation of the proposed system is based on two tasks: first, the LibriSpeech task showing a speed-up of 7x and 6% Word Error Rate reduction (WERR) compared to the baseline results. The second task is based on session adaptation providing an improvement of 20% WERR over a competitive production-ready LAS model. The proposed Federated Learning system is shown to outperform the golden standard of distributed training in both convergence speed and overall model performance.

computer based training, deep learning, educational technology, (20 more...)

arXiv.org Machine Learning

2008.02452

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback