AITopics | vsr system

Collaborating Authors

vsr system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

Wang, He, Guo, Pengcheng, Chen, Wei, Zhou, Pan, Xie, Lei

arXiv.org Artificial IntelligenceJan-7-2024

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation. The VSR model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D visual frontend, an E-Branchformer encoder, and a Transformer decoder. Experiments show that our system achieves 34.76% CER for the Single-Speaker Task and 41.06% CER for the Multi-Speaker Task after multi-system fusion, ranking first place in all three tracks we participate.

encoder, speech recognition, vsr task, (12 more...)

arXiv.org Artificial Intelligence

2401.06788

Country: Asia > China (0.05)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

Gimeno-Gómez, David, Martínez-Hinarejos, Carlos-D.

arXiv.org Artificial IntelligenceNov-21-2023

Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step fine-tuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.

recognition, speech recognition, vsr system, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/IberSPEECH.2022-9

2311.1248

Country: Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey

Nemani, Praneeth, Krishna, G. Sai, Kundrapu, Supriya

arXiv.org Artificial IntelligenceJun-14-2023

Speaker-independent VSR is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speaker-independent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.

dataset, recognition, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.imavis.2023.104787

2306.08314

Country:

Europe > Austria > Vienna (0.14)
Asia > India (0.04)
South America (0.04)
(11 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (0.92)
Media (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
(8 more...)

Add feedback

Real-Time Super-Resolution System of 4K-Video Based on Deep Learning

Cao, Yanpeng, Wang, Chengcheng, Song, Changjun, Tang, Yongming, Li, He

arXiv.org Artificial IntelligenceJul-14-2021

Video super-resolution (VSR) technology excels in reconstructing low-quality video, avoiding unpleasant blur effect caused by interpolation-based algorithms. However, vast computation complexity and memory occupation hampers the edge of deplorability and the runtime inference in real-life applications, especially for large-scale VSR task. This paper explores the possibility of real-time VSR system and designs an efficient and generic VSR network, termed EGVSR. The proposed EGVSR is based on spatio-temporal adversarial learning for temporal coherence. In order to pursue faster VSR processing ability up to 4K resolution, this paper tries to choose lightweight network structure and efficient upsampling method to reduce the computation required by EGVSR network under the guarantee of high visual quality. Besides, we implement the batch normalization computation fusion, convolutional acceleration algorithm and other neural network acceleration techniques on the actual hardware platform to optimize the inference process of EGVSR network. Finally, our EGVSR achieves the real-time processing capacity of 4K@29.61FPS. Compared with TecoGAN, the most advanced VSR network at present, we achieve 85.04% reduction of computation density and 7.92x performance speedups. In terms of visual quality, the proposed EGVSR tops the list of most metrics (such as LPIPS, tOF, tLP, etc.) on the public test dataset Vid4 and surpasses other state-of-the-art methods in overall performance score. The source code of this project can be found on https://github.com/Thmen/EGVSR.

computation, egvsr network, vsr network, (14 more...)

arXiv.org Artificial Intelligence

2107.05307

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.84)

Industry: Semiconductors & Electronics (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.84)

Add feedback