text response
Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
Arora, Siddhant, Tian, Jinchuan, Futami, Hayato, Shi, Jiatong, Kashiwagi, Yosuke, Tsunoo, Emiru, Watanabe, Shinji
Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (V AD) for turn-taking, but V AD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit V AD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets--aligned user transcripts and system responses--for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (2 more...)
GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM
Song, Yaodong, Chen, Hongjie, Lian, Jie, Zhang, Yuxin, Xia, Guangmin, Li, Zehan, Zhao, Genliang, Kang, Jian, Li, Jie, Li, Yongxiang, Li, Xuelong
While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-n layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
- Europe > Austria > Vienna (0.14)
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (4 more...)
Three takeaways about AI's energy use and climate impacts
One key caveat here is that we don't know much about "closed source" models--for these, companies hold back the details of how they work. Instead, we worked with researchers who measured the energy it takes to run open-source AI models, for which the source code is publicly available. But using open-source models, it's possible to directly measure the energy used to respond to a query rather than just guess. We worked with researchers who generated text, images, and video and measured the energy required for the chips the models are based on to perform the task. Even just within the text responses, there was a pretty large range of energy needs.
Learning to Represent Individual Differences for Choice Decision Making
Chen, Yan-Ying, Weng, Yue, Filipowicz, Alexandre, Iliev, Rumen, Chen, Francine, Hakimi, Shabnam, Zhang, Yanxia, Lee, Matthew, Lyons, Kent, Wu, Charlene
Human decision making can be challenging to predict because decisions are affected by a number of complex factors. Adding to this complexity, decision-making processes can differ considerably between individuals, and methods aimed at predicting human decisions need to take individual differences into account. Behavioral science offers methods by which to measure individual differences (e.g., questionnaires, behavioral models), but these are often narrowed down to low dimensions and not tailored to specific prediction tasks. This paper investigates the use of representation learning to measure individual differences from behavioral experiment data. Representation learning offers a flexible approach to create individual embeddings from data that are both structured (e.g., demographic information) and unstructured (e.g., free text), where the flexibility provides more options for individual difference measures for personalization, e.g., free text responses may allow for open-ended questions that are less privacy-sensitive. In the current paper we use representation learning to characterize individual differences in human performance on an economic decision-making task. We demonstrate that models using representation learning to capture individual differences consistently improve decision predictions over models without representation learning, and even outperform well-known theory-based behavioral models used in these environments. Our results propose that representation learning offers a useful and flexible tool to capture individual differences.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Los Altos (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Gao, Heting, Shao, Hang, Wang, Xiong, Qiu, Chaofan, Shen, Yunhang, Cai, Siqi, Shi, Yuchen, Xu, Zihan, Long, Zuwei, Zhang, Yike, Dong, Shaoqi, Fu, Chaoyou, Li, Ke, Ma, Long, Sun, Xing
The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.
- Leisure & Entertainment (0.48)
- Health & Medicine > Therapeutic Area (0.34)
- Media (0.34)
WavChat: A Survey of Spoken Dialogue Models
Ji, Shengpeng, Chen, Yifu, Fang, Minghui, Zuo, Jialong, Lu, Jingyu, Wang, Hanting, Jiang, Ziyue, Zhou, Long, Liu, Shujie, Cheng, Xize, Yang, Xiaoda, Wang, Zehan, Yang, Qian, Li, Jian, Jiang, Yidi, He, Jingzhen, Chu, Yunfei, Xu, Jin, Zhao, Zhou
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology (0.92)
- Education (0.92)
LLMD: A Large Language Model for Interpreting Longitudinal Medical Records
Porter, Robert, Diehl, Adam, Pastel, Benjamin, Hinnefeld, J. Henry, Nerenberg, Lawson, Maung, Pye, Kerbrat, Sebastien, Hanson, Gillian, Astorino, Troy, Tarsa, Stephen J.
We introduce LLMD, a large language model designed to analyze a patient's medical history based on their medical records. Along with domain knowledge, LLMD is trained on a large corpus of records collected over time and across facilities, as well as tasks and labels that make nuanced connections among them. This approach is critical to an accurate picture of patient health, and has distinctive advantages over models trained on knowledge alone, unlabeled records, structured EHR data, or records from a single health system. The recipe for LLMD continues pretraining a foundational model on both domain knowledge and the contents of millions of records. These span an average of 10 years of care and as many as 140 care sites per patient. LLMD is then instruction fine-tuned on structuring and abstraction tasks. The former jointly identify and normalize document metadata, provenance information, clinical named-entities, and ontology mappings, while the latter roll these into higher-level representations, such a continuous era of time a patient was on a medication. LLMD is deployed within a layered validation system that includes continual random audits and review by experts, e.g. based on uncertainty, disease-specific rules, or use-case. LLMD exhibits large gains over both more-powerful generalized models and domain-specific models. On medical knowledge benchmarks, LLMD-8B achieves state of the art accuracy on PubMedQA text responses, besting orders-of-magnitude larger models. On production tasks, we show that LLMD significantly outperforms all other models evaluated, and among alternatives, large general purpose LLMs like GPT-4o are more accurate than models emphasizing medical knowledge. We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.'
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Fang, Qingkai, Guo, Shoutao, Zhou, Yan, Ma, Zhengrui, Zhang, Shaolei, Feng, Yang
Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future. Large language models (LLMs), represented by ChatGPT (OpenAI, 2022), have become powerful general-purpose task solvers, capable of assisting people in daily life through conversational interactions. However, most LLMs currently only support text-based interactions, which limits their application in scenarios where text input and output are not ideal. Recently, the emergence of GPT-4o (OpenAI, 2024) has made it possible to interact with LLMs through speech, responding to user's instruction with extremely low latency and significantly enhancing the user experience.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
Prompt Stability Scoring for Text Annotation with Large Language Models
Barrie, Christopher, Palaiologou, Elli, Törnberg, Petter
Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
- Europe > United Kingdom (0.29)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
Banerjee, Somnath, Layek, Sayan, Hazra, Rima, Mukherjee, Animesh
In this study, we tackle a growing concern around the safety and ethical use of large language models (LLMs). Despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. Our work zeroes in on a specific issue: to what extent LLMs can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. To investigate this question, we introduce TechHazardQA, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. We query a series of LLMs -- Llama-2-13b, Llama-2-7b, Mistral-V2 and Mistral 8X7B -- and ask them to generate both text and instruction-centric responses. For evaluation we report the harmfulness score metric as well as judgements from GPT-4 and humans. Overall, we observe that asking LLMs to produce instruction-centric responses enhances the unethical response generation by ~2-38% across the models. As an additional objective, we investigate the impact of model editing using the ROME technique, which further increases the propensity for generating undesirable content. In particular, asking edited LLMs to generate instruction-centric responses further increases the unethical response generation by ~3-16% across the different models.
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)