dictation
Understanding Cross Task Generalization in Handwriting-Based Alzheimer's Screening via Vision Language Adaptation
Gong, Changqing, Qin, Huafeng, El-Yacoubi, Mounim A.
Alzheimer's disease is a prevalent neurodegenerative disorder for which early detection is critical. Handwriting-often disrupted in prodromal AD-provides a non-invasive and cost-effective window into subtle motor and cognitive decline. Existing handwriting-based AD studies, mostly relying on online trajectories and hand-crafted features, have not systematically examined how task type influences diagnostic performance and cross-task generalization. Meanwhile, large-scale vision language models have demonstrated remarkable zero or few-shot anomaly detection in natural images and strong adaptability across medical modalities such as chest X-ray and brain MRI. However, handwriting-based disease detection remains largely unexplored within this paradigm. To close this gap, we introduce a lightweight Cross-Layer Fusion Adapter framework that repurposes CLIP for handwriting-based AD screening. CLFA implants multi-level fusion adapters within the visual encoder to progressively align representations toward handwriting-specific medical cues, enabling prompt-free and efficient zero-shot inference. Using this framework, we systematically investigate cross-task generalization-training on a specific handwriting task and evaluating on unseen ones-to reveal which task types and writing patterns most effectively discriminate AD. Extensive analyses further highlight characteristic stroke patterns and task-level factors that contribute to early AD identification, offering both diagnostic insights and a benchmark for handwriting-based cognitive assessment.
- Asia > China > Chongqing Province > Chongqing (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > France (0.04)
- Asia > South Korea (0.04)
Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
Corbeil, Jean-Philippe, Abacha, Asma Ben, Michalopoulos, George, Swazinna, Phillip, Del-Agua, Miguel, Tremblay, Jerome, Daniel, Akila Jeeson, Bader, Cari, Cho, Yu-Cheng, Krishnan, Pooja, Bodenstab, Nathan, Lin, Thomas, Teng, Wenxuan, Beaulieu, Francois, Vozila, Paul
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
Automatic Speech Recognition for Greek Medical Dictation
Georgilas, Vardis, Stafylakis, Themos
Medical dictation systems are essential tools in modern healthcare, enabling accurate and efficient conversion of speech into written medical documentation. The main objective of this paper is to create a domain-specific system for Greek medical speech transcriptions. The ultimate goal is to assist healthcare professionals by reducing the overload of manual documentation and improving workflow efficiency. Towards this goal, we develop a system that combines automatic speech recognition techniques with text correction model, allowing better handling of domain-specific terminology and linguistic variations in Greek. Our approach leverages both acoustic and textual modeling to create more realistic and reliable transcriptions. We focused on adapting existing language and speech technologies to the Greek medical context, addressing challenges such as complex medical terminology and linguistic inconsistencies. Through domain-specific fine-tuning, our system achieves more accurate and coherent transcriptions, contributing to the development of practical language technologies for the Greek healthcare sector.
- Europe > Greece (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Scalable Offline ASR for Command-Style Dictation in Courtrooms
Nethil, Kumarmanas, Mishra, Vaibhav, Anandan, Kriti, Manohar, Kavya
We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and transcribes these segments in parallel using Whisper models, enabling efficient multiplexing across audios. Unlike proprietary systems like SuperWhisper, this framework is also compatible with most ASR architectures, including widely used CTC-based models. Our multiplexing technique maximizes compute utilization in real-world settings, as demonstrated by its deployment in around 15% of India's courtrooms. Evaluations on live data show consistent latency reduction as user concurrency increases, compared to sequential batch processing. The live demonstration will showcase our open-sourced implementation and allow attendees to interact with it in real-time.
StepWrite: Adaptive Planning for Speech-Driven Text Generation
Alaoui, Hamza El, Taheri, Atieh, Peng, Yi-Hao, Bigham, Jeffrey P.
People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions--capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite's capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > South Korea > Busan > Busan (0.05)
- (4 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (0.93)
- Research Report > Experimental Study > Negative Result (0.45)
- Information Technology (1.00)
- Leisure & Entertainment (0.92)
- Health & Medicine > Consumer Health (0.46)
- (2 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- (2 more...)
Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-primary Speakers
Arumugam, Guru Prakash, Chang, Shuo-yiin, Sainath, Tara N., Prabhavalkar, Rohit, Wang, Quan, Bijwadia, Shaan
ASR models often suffer from a long-form deletion problem where the model predicts sequential blanks instead of words when transcribing a lengthy audio (in the order of minutes or hours). From the perspective of a user or downstream system consuming the ASR results, this behavior can be perceived as the model "being stuck", and potentially make the product hard to use. One of the culprits for long-form deletion is training-test data mismatch, which can happen even when the model is trained on diverse and large-scale data collected from multiple application domains. In this work, we introduce a novel technique to simultaneously model different groups of speakers in the audio along with the standard transcript tokens. Speakers are grouped as primary and non-primary, which connects the application domains and significantly alleviates the long-form deletion problem. This improved model neither needs any additional training data nor incurs additional training or inference cost.
Toward Interactive Dictation
Li, Belinda Z., Eisner, Jason, Pauls, Adam, Thomson, Sam
Voice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, TERTiUS, to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 30% end-state accuracy with 1.3 seconds of latency, while a larger model achieves 55% end-state accuracy with 7 seconds of latency.
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (9 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Gadget News, Latest Technology News, Tech News, Gadgets Reviews, Mobile, Tablet, Laptop, Science, Social Media, Apps, Device News, Tech Reviews
Tech giant Microsoft is rolling out a new Dictate feature to OneNote that supports AI-powered voice commands to control dictation, such as deleting text or undoing a recent step. The AI-powered voice commands can be used to format and edit text, such as deleting a word or undoing a recent step, and the platform said it plans to add more voice commands to Dictate over the coming months, reports Windows Central. "Now it is easy to break away from the keyboard and stay in the flow by using Dictate with AI-backed voice commands to add, format, edit, and organise your text," Sofia Thomas, Product Manager of Microsoft's Office Voice Team was quoted as saying. "Over the next few months, we will be adding new voice commands as well as some that are already available in other Office apps to One Note," it added. Dictate works with over 50 languages and provides an alternative way to input text within OneNote.
Machine learning, AI can help ease the trend of physician burnout
Dr. Steven Waldren, vice president and chief informatics officer at the American Academy of Family Physicians, right, and Dr. Kamel Sadek, director of informatics at Village Medical, speak at the HIMSS22 conference in Orlando. ORLANDO, Fla. – Even before COVID-19 made the business of healthcare a nightmare for countless physicians and clinicians, burnout was a prevalent issue. And even the slow, still-ongoing emergence into normalcy hasn't been enough to ease this trend: Clerical burdens, including clinical documentation, are a major contributor. But for primary care physicians in particular, a new class of technology, including AI-powered digital assistants, is improving their capacity and capability, while reducing their administrative and cognitive burden. Dr. Steven Waldren, vice president and chief informatics officer at the American Academy of Family Physicians, cited data showing that the average patient visit to a PCP takes about 18 minutes, and of that time, 27% is dedicated to face-to-face time with a patient.
New Windows 11 build tests Voice Access, Spotlight backgrounds
Microsoft issued a meaty Windows Insider Build on Wednesday for the Dev channel, testing one substantial improvement: Voice Access, as well as a couple of personalization improvements that should be welcomed by Windows users. Technically, the new features offered in Build 22518 of the Dev Channel for Windows 11 are new, untested code, which might not even make it to the stable channel. Still, there's a good chance that at least Voice Access will come to market, as it leans on Microsoft's accessibility strengths. Microsoft's new build has also added a "Spotlight" feature that will provide fresh, updated desktop backgrounds, and it tweaked the Widgets feature to resemble Windows 10. Microsoft describes Voice Access as a new feature, and one that's distinct from dictation, which has been in Windows for some time.