Discourse & Dialogue
Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse
Bailey, Mark M., Heiligman, Mark I.
How can we detect when global events fundamentally reshape public discourse? This study introduces a topological framework for identifying structural change in media narratives using persistent homology. Drawing on international news articles surrounding major events - including the Russian invasion of Ukraine (Feb 2022), the murder of George Floyd (May 2020), the U.S. Capitol insurrection (Jan 2021), and the Hamas-led invasion of Israel (Oct 2023) - we construct daily co-occurrence graphs of noun phrases to trace evolving discourse. Each graph is embedded and transformed into a persistence diagram via a Vietoris-Rips filtration. We then compute Wasserstein distances and persistence entropies across homological dimensions to capture semantic disruption and narrative volatility over time. Our results show that major geopolitical and social events align with sharp spikes in both H0 (connected components) and H1 (loops), indicating sudden reorganization in narrative structure and coherence. Cross-correlation analyses reveal a typical lag pattern in which changes to component-level structure (H0) precede higher-order motif shifts (H1), suggesting a bottom-up cascade of semantic change. An exception occurs during the Russian invasion of Ukraine, where H1 entropy leads H0, possibly reflecting top-down narrative framing before local discourse adjusts. Persistence entropy further distinguishes tightly focused from diffuse narrative regimes. These findings demonstrate that persistent homology offers a mathematically principled, unsupervised method for detecting inflection points and directional shifts in public attention - without requiring prior knowledge of specific events. This topological approach advances computational social science by enabling real-time detection of semantic restructuring during crises, protests, and information shocks.
EmoNews: A Spoken Dialogue System for Expressive News Conversations
Matsuura, Ryuki, Bharadwaj, Shikhar, Liu, Jiarui, Govindarajan, Dhatchi Kunde
We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1
Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings
Dukiฤ, David, Bariฤ, Ana, ฤuljak, Marko, Jukiฤ, Josip, Tutek, Martin
Measuring how semantics of words change over time improves our understanding of how cultures and perspectives change. Diachronic word embeddings help us quantify this shift, although previous studies leveraged substantial temporally annotated corpora. In this work, we use a corpus of 9.5 million Croatian news articles spanning the past 25 years and quantify semantic change using skip-gram word embeddings trained on five-year periods. Our analysis finds that word embeddings capture linguistic shifts of terms pertaining to major topics in this timespan (COVID-19, Croatia joining the European Union, technological advancements). We also find evidence that embeddings from post-2020 encode increased positivity in sentiment analysis tasks, contrasting studies reporting a decline in mental health over the same period.
A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations
Lan, Tian, Zhou, Yang-Hao, Ma, Zi-Ao, Sun, Fanshu, Sun, Rui-Qing, Luo, Junyu, Tu, Rong-Cheng, Huang, Heyan, Xu, Chen, Wu, Zhijing, Mao, Xian-Ling
Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.
What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps
Chhetri, Vinaik, Upadhyay, Krishna, Siddique, A. B., Farooq, Umar
Artificial Intelligence (AI)-powered features have rapidly proliferated across mobile apps in various domains, including productivity, education, entertainment, and creativity. However, how users perceive, evaluate, and critique these AI features remains largely unexplored, primarily due to the overwhelming volume of user feedback. In this work, we present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps, leveraging a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play. We develop and validate a multi-stage analysis pipeline that begins with a human-labeled benchmark and systematically evaluates large language models (LLMs) and prompting strategies. Each stage, including review classification, aspect-sentiment extraction, and clustering, is validated for accuracy and consistency. Our pipeline enables scalable, high-precision analysis of user feedback, extracting over one million aspect-sentiment pairs clustered into 18 positive and 15 negative user topics. Our analysis reveals that users consistently focus on a narrow set of themes: positive comments emphasize productivity, reliability, and personalized assistance, while negative feedback highlights technical failures (e.g., scanning and recognition), pricing concerns, and limitations in language support. Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review. These fine-grained, co-occurring sentiments are often missed by traditional approaches that treat positive and negative feedback in isolation or rely on coarse-grained analysis. To this end, our approach provides a more faithful reflection of the real-world user experiences with AI-powered apps. Category-aware analysis further uncovers both universal drivers of satisfaction and domain-specific frustrations.
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Reichman, Benjamin, Patsch, Constantin, Truxal, Jack, Jain, Atishay, Heck, Larry
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
Approaching Dialogue State Tracking via Aligning Speech Encoders and LLMs
Sedlรกฤek, ล imon, Yusuf, Bolaji, ล vec, Jรกn, Hegde, Pradyoth, Kesiraju, Santosh, Plchot, Oldลich, ฤernockรฝ, Jan
In this work, we approach spoken Dialogue State Tracking (DST) by bridging the representation spaces of speech encoders and LLMs via a small connector module, with a focus on fully open-sourced and open-data components (WavLM-large, OLMo). We focus on ablating different aspects of such systems including full/LoRA adapter fine-tuning, the effect of agent turns in the dialogue history, as well as fuzzy matching-based output post-processing, which greatly improves performance of our systems on named entities in the dialogue slot values. We conduct our experiments on the SpokenWOZ dataset, and additionally utilize the Speech-Aware MultiWOZ dataset to augment our training data. Ultimately, our best-performing WavLM + connector + OLMo-1B aligned models achieve state of the art on the SpokenWOZ test set (34.66% JGA), and our system with Gemma-2-9B-instruct further surpasses this result, reaching 42.17% JGA on SpokenWOZ test.
E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) -- exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Quantum Graph Transformer for NLP Sentiment Classification
Aktar, Shamminuj, Bรคrtschi, Andreas, Badawy, Abdel-Hameed A., Eidenbenz, Stephan
Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.
DiscoSum: Discourse-aware News Summarization
Spangher, Alexander, Huang, Tenghao, Gu, Jialiang, Shi, Jiatong, Chen, Muhao
Recent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.