AITopics | stt model

Collaborating Authors

stt model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

STT: Stateful Tracking with Transformers for Autonomous Driving

Jing, Longlong, Yu, Ruichi, Chen, Xu, Zhao, Zhengli, Sheng, Shiwei, Graber, Colin, Chen, Qi, Li, Qinru, Wu, Shangxuan, Deng, Han, Lee, Sangjin, Sweeney, Chris, He, Qiurui, Hung, Wei-Chih, He, Tong, Zhou, Xingyi, Moussavi, Farshid, Guo, Zijian, Zhou, Yin, Tan, Mingxing, Yang, Weilong, Li, Congcong

arXiv.org Artificial IntelligenceApr-30-2024

Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.

detection, state estimation, stt, (13 more...)

arXiv.org Artificial Intelligence

2405.00236

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (0.61)
Information Technology > Robotics & Automation (0.61)
Automobiles & Trucks (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.85)
Information Technology > Artificial Intelligence > Vision (0.68)

Add feedback

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Huh, Jaeyoung, Park, Sangjoon, Lee, Jeong Eun, Ye, Jong Chul

arXiv.org Artificial IntelligenceFeb-27-2023

Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.00091

Country:

Asia > South Korea > Daejeon > Daejeon (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Latvia > Riga Municipality > Riga (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.94)

Industry:

Information Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Yadav, Hemant, Singh, Janvijay, Singh, Atul Anshuman, Mittal, Rachit, Shah, Rajiv Ratn

arXiv.org Artificial IntelligenceDec-1-2020

Training robust Speech to Text (STT) system require "tens of thousand" of hours of data. Variability present in the dataset, in the form of unwanted nuisances (noise) and biases (accent, gender or age) is the reason for the need of large datasets to learn general representations, which is unfeasible for low resource languages. A recently proposed deep learning approach to remove these unwanted features, called adversarial forgetting, was able to produce better results on computer vision tasks. Motivated by this, in this paper, we study the effect of de-entangling the accent information from the input speech signal on training STT systems. To this end, we use an information bottleneck architecture based on adversarial forgetting. This training scheme aims to enforce the model to learn general accent invariant speech representations. The trained STT model is tested on two unseen accents in the common voice V1. The results are in favour of STT model trained using the adversarial forgetting scheme.

representation, speech recognition, stt model, (13 more...)

arXiv.org Artificial Intelligence

2011.12979

Country:

Oceania > New Zealand (0.04)
North America > United States (0.04)
Asia > India > NCT > Delhi (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback