Goto

Collaborating Authors

 Discourse & Dialogue


Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

arXiv.org Artificial Intelligence

Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition (ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.


Entriever: Energy-based Retriever for Knowledge-Grounded Dialog Systems

arXiv.org Artificial Intelligence

A retriever, which retrieves relevant knowledge pieces from a knowledge base given a context, is an important component in many natural language processing (NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog systems to improve knowledge acquisition. In knowledge-grounded dialog systems, when conditioning on a given context, there may be multiple relevant and correlated knowledge pieces. However, knowledge pieces are usually assumed to be conditionally independent in current retriever models. To address this issue, we propose Entriever, an energy-based retriever. Entriever directly models the candidate retrieval results as a whole instead of modeling the knowledge pieces separately, with the relevance score defined by an energy function. We explore various architectures of energy functions and different training methods for Entriever, and show that Entriever substantially outperforms the strong cross-encoder baseline in knowledge retrieval tasks. Furthermore, we show that in semi-supervised training of knowledge-grounded dialog systems, Entriever enables effective scoring of retrieved knowledge pieces and significantly improves end-to-end performance of dialog systems.


Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

The paper attempts to move away from traditional evaluation of open-domain dialog systems (i.e., judge response given its conversation history) and moves towards a more interactive one (i.e., human talking to a bot), which is likely an important step towards better evaluation. However, I do have several serious concerns about this work in its current form: (1) The authors contrast their work with existing evaluation for open-domain dialog evaluation, which they call "single-turn" evaluation. They point out that this type of evaluation prevents it from capturing "failure modes [โ€ฆ] such as a lack of diversity in the responses, inability to track long-term aspects of the conversation". I think this is rather misleading and the term is "single-turn" is a misnomer. Most previous work has indeed evaluated each conversation by factorizing it into a sequence of independent turn-level judgments, but each of these judgments assesses the quality of the current turn T_n **given** a history of several previous turns โ€ฆ, T_n-k, โ€ฆ T_n-1.


Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Neural Information Processing Systems

This paper explores interesting directions, in particular 1) using interactive settings to evaluate a model rather than a single answer, and 2) combining different automated metrics in a weighted sums to approximate human evaluation (e.g., based on sentiment). Reviewers have raised crucial points, regarding gameability (so that using the metrics for training a model is tricky if not followed by a non-gameable evaluation), and lack of comparability between different self-play. It's indeed a much better evaluation setting if the system does not control both sides (e.g., models being matched to the same set of fixed models), so authors should definitely follow that direction. However, I expect this work would still be interesting to the dialog community: many of the diagnostic advantages of the model-talking-to-model setting remain, in practice, especially because the model is in fact not trained with the self-play objective, but that criterion is only used post hoc (so the system can't extensively exploit it during training). In practice, a lot of the problems of the generations of a given model already show up during self-play, and the reasonable worry raised by reviewers that the model could exploit the metric remains theoretical at the moment.


FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis

arXiv.org Artificial Intelligence

In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on https://github.com/cwei01/FCKT.


Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments

arXiv.org Artificial Intelligence

Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.


Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline

Neural Information Processing Systems

Existing video multi-modal sentiment analysis mainly focuses on the sentiment expression of people within the video, yet often neglects the induced sentiment of viewers while watching the videos. Induced sentiment of viewers is essential for inferring the public response to videos and has broad application in analyzing public societal sentiment, effectiveness of advertising and other areas. The micro videos and the related comments provide a rich application scenario for viewers' induced sentiment analysis. In light of this, we introduces a novel research task, Multimodal Sentiment Analysis for Comment Response of Video Induced(MSA-CRVI), aims to infer opinions and emotions according to comments response to micro video. Meanwhile, we manually annotate a dataset named Comment Sentiment toward to Micro Video (CSMV) to support this research.


FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model

Neural Information Processing Systems

Topic models have been evolving rapidly over the years, from conventional to recent neural models. In this paper, we propose FASTopic, a fast, adaptive, stable, and transferable topic model. FASTopic follows a new paradigm: Dual Semantic-relation Reconstruction (DSR). By reconstructing through these semantic relations, DSR discovers latent topics. This brings about a neat and efficient topic modeling framework.


SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Neural Information Processing Systems

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information.This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction.Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech.Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses.We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation.To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation.SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound.To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a process similar to that of SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g.


Towards Robust Multimodal Sentiment Analysis with Incomplete Data

Neural Information Processing Systems

The field of Multimodal Sentiment Analysis (MSA) has recently witnessed an emerging direction seeking to tackle the issue of data incompleteness. Recognizing that the language modality typically contains dense sentiment information, we consider it as the dominant modality and present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust MSA. The proposed LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model's robustness across various noise scenarios by ensuring the quality of dominant modality representations. Aside from the methodical design, we perform comprehensive experiments under random data missing scenarios, utilizing diverse and meaningful settings on several popular datasets (e.g., MOSI, MOSEI, and SIMS), providing additional uniformity, transparency, and fairness compared to existing evaluations in the literature. Empirically, LNLN consistently outperforms existing baselines, demonstrating superior performance across these challenging and extensive evaluation metrics.