Yang, Qian
How Problematic Writer-AI Interactions (Rather than Problematic AI) Hinder Writers' Idea Generation
Umarova, Khonzoda, Wise, Talia, Lyu, Zhuoer, Lee, Mina, Yang, Qian
Writing about a subject enriches writers' understanding of that subject. This cognitive benefit of writing -- known as constructive learning -- is essential to how students learn in various disciplines. However, does this benefit persist when students write with generative AI writing assistants? Prior research suggests the answer varies based on the type of AI, e.g., auto-complete systems tend to hinder ideation, while assistants that pose Socratic questions facilitate it. This paper adds an additional perspective. Through a case study, we demonstrate that the impact of genAI on students' idea development depends not only on the AI but also on the students and, crucially, their interactions in between. Students who proactively explored ideas gained new ideas from writing, regardless of whether they used auto-complete or Socratic AI assistants. Those who engaged in prolonged, mindless copyediting developed few ideas even with a Socratic AI. These findings suggest opportunities in designing AI writing assistants, not merely by creating more thought-provoking AI, but also by fostering more thought-provoking writer-AI interactions.
Cross-platform Prediction of Depression Treatment Outcome Using Location Sensory Data on Smartphones
Sahoo, Soumyashree, Shende, Chinmaey, Hossain, Md. Zakir, Patel, Parit, Niu, Yushuo, Wang, Xinyu, Ware, Shweta, Bi, Jinbo, Kamath, Jayesh, Russel, Alexander, Song, Dongjin, Yang, Qian, Wang, Bing
Currently, depression treatment relies on closely monitoring patients response to treatment and adjusting the treatment as needed. Using self-reported or physician-administrated questionnaires to monitor treatment response is, however, burdensome, costly and suffers from recall bias. In this paper, we explore using location sensory data collected passively on smartphones to predict treatment outcome. To address heterogeneous data collection on Android and iOS phones, the two predominant smartphone platforms, we explore using domain adaptation techniques to map their data to a common feature space, and then use the data jointly to train machine learning models. Our results show that this domain adaptation approach can lead to significantly better prediction than that with no domain adaptation. In addition, our results show that using location features and baseline self-reported questionnaire score can lead to F1 score up to 0.67, comparable to that obtained using periodic self-reported questionnaires, indicating that using location data is a promising direction for predicting depression treatment outcome.
Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning
Yang, Qian, Graham, Calbert
VAEs consist of two main parts: a content Voice conversion (VC) modifies voice characteristics while encoder and a decoder. The content encoder processes source preserving linguistic content. This paper presents the Stepback speech, transforms it into a latent representation, and removes network, a novel model for converting speaker identity using speaker information. The decoder takes the speaker identity, non-parallel data. Unlike traditional VC methods that rely on combines it with the latent representation, and reconstructs the parallel data, our approach leverages deep learning techniques speech[5]. A notable VAE approach is disentangling speaker to enhance disentanglement completion and linguistic content and content representations using instance normalization, which preservation.
WavChat: A Survey of Spoken Dialogue Models
Ji, Shengpeng, Chen, Yifu, Fang, Minghui, Zuo, Jialong, Lu, Jingyu, Wang, Hanting, Jiang, Ziyue, Zhou, Long, Liu, Shujie, Cheng, Xize, Yang, Xiaoda, Wang, Zehan, Yang, Qian, Li, Jian, Jiang, Yidi, He, Jingzhen, Chu, Yunfei, Xu, Jin, Zhao, Zhou
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat.
Thoughtful Adoption of NLP for Civic Participation: Understanding Differences Among Policymakers
Guridi, Jose A., Cheyre, Cristobal, Yang, Qian
Natural language processing (NLP) tools have the potential to boost civic participation and enhance democratic processes because they can significantly increase governments' capacity to gather and analyze citizen opinions. However, their adoption in government remains limited, and harnessing their benefits while preventing unintended consequences remains a challenge. While prior work has focused on improving NLP performance, this work examines how different internal government stakeholders influence NLP tools' thoughtful adoption. We interviewed seven politicians (politically appointed officials as heads of government institutions) and thirteen public servants (career government employees who design and administrate policy interventions), inquiring how they choose whether and how to use NLP tools to support civic participation processes. The interviews suggest that policymakers across both groups focused on their needs for career advancement and the need to showcase the legitimacy and fairness of their work when considering NLP tool adoption and use. Because these needs vary between politicians and public servants, their preferred NLP features and tool designs also differ. Interestingly, despite their differing needs and opinions, neither group clearly identifies who should advocate for NLP adoption to enhance civic participation or address the unintended consequences of a poorly considered adoption. This lack of clarity in responsibility might have caused the governments' low adoption of NLP tools. We discuss how these findings reveal new insights for future HCI research. They inform the design of NLP tools for increasing civic participation efficiency and capacity, the design of other tools and methods that ensure thoughtful adoption of AI tools in government, and the design of NLP tools for collaborative use among users with different incentives and needs.
An adapted large language model facilitates multiple medical tasks in diabetes care
Wei, Lai, Ying, Zhen, He, Muyang, Chen, Yutong, Yang, Qian, Hong, Yanzhe, Lu, Jiaping, Li, Xiaoying, Huang, Weiran, Chen, Ying
Diabetes is a chronic disease that poses a significant global health burden, and optimizing diabetes management requires multi-stakeholder collaboration. Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across a diverse range of diabetes tasks remains unproven. In this study, we introduced a framework to train and validate diabetes-specific LLMs. We first developed a comprehensive data processing pipeline that includes data collection, filtering, augmentation and refinement. This approach contributes to creating a high-quality, diabetes-specific dataset, and several evaluation benchmarks entirely from scratch. Utilizing the collected training dataset, we fine-tuned a diabetes-specific LLM family that demonstrated state-of-the-art proficiency in understanding and processing various diabetes tasks compared to other LLMs. Furthermore, clinical studies showed the potential applications of our models in diabetes care, including providing personalized healthcare, assisting medical education, and streamlining clinical tasks. In conclusion, our study introduced a framework to develop and evaluate a diabetes-specific LLM family, and highlighted its potential to enhance clinical practice and provide personalized, data-driven support for diabetes support when facing different end users.
Qwen2-Audio Technical Report
Chu, Yunfei, Xu, Jin, Yang, Qian, Wei, Haojie, Wei, Xipin, Guo, Zhifang, Leng, Yichong, Lv, Yuanjun, He, Jinzheng, Lin, Junyang, Zhou, Chang, Zhou, Jingren
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
Yang, Qian, Yan, Weixiang, Agrawal, Aishwarya
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose \textbf{De}compose and \textbf{C}ompare \textbf{C}onsistency (\texttt{DeCC}) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, \texttt{DeCC} measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show \texttt{DeCC}'s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification
Tian, Yuchen, Yan, Weixiang, Yang, Qian, Chen, Qian, Wang, Wen, Luo, Ziyang, Ma, Lei
Large Language Models (LLMs) have made significant progress in code generation, providing developers with unprecedented automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible but may not execute as expected or meet specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To enhance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We classify code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we develop a dynamic detection algorithm named CodeHalu to quantify code hallucinations and establish the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs on this benchmark, we reveal significant differences in their accuracy and reliability in code generation and provide detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
A Piece of Theatre: Investigating How Teachers Design LLM Chatbots to Assist Adolescent Cyberbullying Education
Hedderich, Michael A., Bazarova, Natalie N., Zou, Wenting, Shim, Ryun, Ma, Xinda, Yang, Qian
Cyberbullying harms teenagers' mental health, and teaching them upstanding intervention is crucial. Wizard-of-Oz studies show chatbots can scale up personalized and interactive cyberbullying education, but implementing such chatbots is a challenging and delicate task. We created a no-code chatbot design tool for K-12 teachers. Using large language models and prompt chaining, our tool allows teachers to prototype bespoke dialogue flows and chatbot utterances. In offering this tool, we explore teachers' distinctive needs when designing chatbots to assist their teaching, and how chatbot design tools might better support them. Our findings reveal that teachers welcome the tool enthusiastically. Moreover, they see themselves as playwrights guiding both the students' and the chatbot's behaviors, while allowing for some improvisation. Their goal is to enable students to rehearse both desirable and undesirable reactions to cyberbullying in a safe environment. We discuss the design opportunities LLM-Chains offer for empowering teachers and the research opportunities this work opens up.