Goto

Collaborating Authors

 stt model


STT: Stateful Tracking with Transformers for Autonomous Driving

Jing, Longlong, Yu, Ruichi, Chen, Xu, Zhao, Zhengli, Sheng, Shiwei, Graber, Colin, Chen, Qi, Li, Qinru, Wu, Shangxuan, Deng, Han, Lee, Sangjin, Sweeney, Chris, He, Qiurui, Hung, Wei-Chih, He, Tong, Zhou, Xingyi, Moussavi, Farshid, Guo, Zijian, Zhou, Yin, Tan, Mingxing, Yang, Weilong, Li, Congcong

arXiv.org Artificial Intelligence

Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.


Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Huh, Jaeyoung, Park, Sangjoon, Lee, Jeong Eun, Ye, Jong Chul

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.


De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Yadav, Hemant, Singh, Janvijay, Singh, Atul Anshuman, Mittal, Rachit, Shah, Rajiv Ratn

arXiv.org Artificial Intelligence

Training robust Speech to Text (STT) system require "tens of thousand" of hours of data. Variability present in the dataset, in the form of unwanted nuisances (noise) and biases (accent, gender or age) is the reason for the need of large datasets to learn general representations, which is unfeasible for low resource languages. A recently proposed deep learning approach to remove these unwanted features, called adversarial forgetting, was able to produce better results on computer vision tasks. Motivated by this, in this paper, we study the effect of de-entangling the accent information from the input speech signal on training STT systems. To this end, we use an information bottleneck architecture based on adversarial forgetting. This training scheme aims to enforce the model to learn general accent invariant speech representations. The trained STT model is tested on two unseen accents in the common voice V1. The results are in favour of STT model trained using the adversarial forgetting scheme.