AITopics | speech-to-text model

Collaborating Authors

speech-to-text model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services

Jin, Liuyi, Haroon, Amran, Stoleru, Radu, Gunawardena, Pasan, Middleton, Michael, Kim, Jeeeun

arXiv.org Artificial IntelligenceNov-19-2025

Timely and accurate pre-arrival video streaming and analytics are critical for emergency medical services (EMS) to deliver life-saving interventions. Yet, current-generation EMS infrastructure remains constrained by one-to-one video streaming and limited analytics capabilities, leaving dispatchers and EMTs to manually interpret overwhelming, often noisy or redundant information in high-stress environments. We present TeleEMS, a mobile live video analytics system that enables pre-arrival multimodal inference by fusing audio and video into a unified decision-making pipeline before EMTs arrive on scene. TeleEMS comprises two key components: TeleEMS Client and TeleEMS Server. The TeleEMS Client runs across phones, smart glasses, and desktops to support bystanders, EMTs en route, and 911 dispatchers. The TeleEMS Server, deployed at the edge, integrates EMS-Stream, a communication backbone that enables smooth multi-party video streaming. On top of EMSStream, the server hosts three real-time analytics modules: (1) audio-to-symptom analytics via EMSLlama, a domain-specialized LLM for robust symptom extraction and normalization; (2) video-to-vital analytics using state-of-the-art rPPG methods for heart rate estimation; and (3) joint text-vital analytics via PreNet, a multimodal multitask model predicting EMS protocols, medication types, medication quantities, and procedures. Evaluation shows that EMSLlama outperforms GPT-4o (exact-match 0.89 vs. 0.57) and that text-vital fusion improves inference robustness, enabling reliable pre-arrival intervention recommendations. TeleEMS demonstrates the potential of mobile live video analytics to transform EMS operations, bridging the gap between bystanders, dispatchers, and EMTs, and paving the way for next-generation intelligent EMS infrastructure.

large language model, machine learning, real time system, (21 more...)

arXiv.org Artificial Intelligence

2511.14119

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Health Care Providers & Services (0.92)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

Multi-Purpose NLP Chatbot : Design, Methodology & Conclusion

Aggarwal, Shivom, Mehra, Shourya, Mitra, Pritha

arXiv.org Artificial IntelligenceOct-13-2023

With a major focus on its history, difficulties, and promise, this research paper provides a thorough analysis of the chatbot technology environment as it exists today. It provides a very flexible chatbot system that makes use of reinforcement learning strategies to improve user interactions and conversational experiences. Additionally, this system makes use of sentiment analysis and natural language processing to determine user moods. The chatbot is a valuable tool across many fields thanks to its amazing characteristics, which include voice-to-voice conversation, multilingual support [12], advising skills, offline functioning, and quick help features. The complexity of chatbot technology development is also explored in this study, along with the causes that have propelled these developments and their far-reaching effects on a range of sectors. According to the study, three crucial elements are crucial: 1) Even without explicit profile information, the chatbot system is built to adeptly understand unique consumer preferences and fluctuating satisfaction levels. With the use of this capacity, user interactions are made to meet their wants and preferences. 2) Using a complex method that interlaces Multiview voice chat information, the chatbot may precisely simulate users' actual experiences. This aids in developing more genuine and interesting discussions. 3) The study presents an original method for improving the black-box deep learning models' capacity for prediction. This improvement is made possible by introducing dynamic satisfaction measurements that are theory-driven, which leads to more precise forecasts of consumer reaction.

architecture, chatbot, interaction, (15 more...)

arXiv.org Artificial Intelligence

2310.08977

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry:

Information Technology (1.00)
Health & Medicine > Therapeutic Area (1.00)
Banking & Finance > Financial Services (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Build a custom speech-to-text model with speaker diarization capabilities

#artificialintelligenceNov-3-2021, 05:40:30 GMT

In this code pattern, learn how to train a custom language and acoustic speech-to-text model to transcribe audio files to get speaker diarized output when given a corpus file and audio recordings of a meeting or classroom. One feature of the IBM Watson Speech to Text service is the capability to detect different speakers from the audio file, also known as speaker diarization. This code pattern shows this capability by training a custom language model with a corpus text file, which then trains the model with'Out of Vocabulary' words as well as a custom acoustic model with the audio files, which train the model with'Accent' detection in a Python Flask run time. Get detailed instructions in the README file. This code pattern is part of the Extracting insights from videos with IBM Watson use case series, which showcases the solution on extracting meaningful insights from videos using Watson Speech to Text, Watson Natural Language Processing, and Watson Tone Analyzer services.

custom speech-to-text model, speaker diarization capability, speech-to-text model, (4 more...)

#artificialintelligence

Industry: Information Technology (0.91)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Cost-effective speech-to-text with weakly- and semi-supervised training

AIHubJun-28-2021, 09:57:34 GMT

Voice assistants equipped with speech-to-text technology have seen a major boost in performance and usage, thanks to the new powerful machine learning methods based on deep neural networks. These methods follow a supervised learning approach, requiring large amounts of paired speech-text data to train the best performing speech-to-text transcription models. After collecting large amounts of relevant and diverse spoken utterances, the complex and intensive task of annotating and labelling of the collected speech data awaits. To get a feel for a typical scenario, let's look at some estimates. On average a typical user query, for example "Do you have the Christmas edition with Santa?", would last for about 3 seconds.

semi-supervised training, speech data, training method, (13 more...)

AIHub

AI-Alerts: 2021 > 2021-06 > AAAI AI-Alert for Jun 29, 2021 (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learn to Build your First Speech-to-Text Model in Python

#artificialintelligenceJul-16-2019, 14:45:37 GMT

This will sound familiar to anyone who has owned a smartphone in the last decade. I can't remember the last time I took the time to type out the entire query on Google Search. I simply ask the question – and Google lays out the entire weather pattern for me. It saves me a ton of time and I can quickly glance at my screen and get back to work. But how does Google understand what I'm saying?

analog signal, audio signal, frequency, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.99)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.74)

Add feedback