Goto

Collaborating Authors

 content preservation


Content preserving text generation with attribute controls

Lajanugen Logeswaran, Honglak Lee, Samy Bengio

Neural Information Processing Systems

We focus on categorical attributes of language. Examples of such attributes include sentiment, language complexity, tense, voice, honorifics, mood, etc. Our approach draws inspiration from styletransfer methods inthevision andlanguage literature.



79ec2a4246feb2126ecf43c4a4418002-Paper.pdf

Neural Information Processing Systems

Weformulate the decoding process asanoptimization problem which allows for multiple attributesweaimtocontrol tobeeasilyincorporated asdifferentiable constraints to the optimization. By relaxing this discrete optimization to a continuous one, we make use of Lagrangian multipliers and gradient-descent based techniques to generate the desired text.


Content preserving text generation with attribute controls

Lajanugen Logeswaran, Honglak Lee, Samy Bengio

Neural Information Processing Systems

In this work, we address the problem of modifying textual attributes of sentences. Given an input sentence and a set of attribute labels, we attempt to generate sentences that are compatible with the conditioning information.


DiffStyleTS: Diffusion Model for Style Transfer in Time Series

Nagda, Mayank, Ostheimer, Phil, Arweiler, Justus, Jungjohann, Indra, Werner, Jennifer, Wagner, Dennis, Muraleedharan, Aparna, Jafari, Pouya, Schmid, Jochen, Jirasek, Fabian, Burger, Jakob, Bortz, Michael, Hasse, Hans, Mandt, Stephan, Kloft, Marius, Fellenz, Sophie

arXiv.org Artificial Intelligence

Style transfer combines the content of one signal with the style of another. It supports applications such as data augmentation and scenario simulation, helping machine learning models generalize in data-scarce domains. While well developed in vision and language, style transfer methods for time series data remain limited. We introduce DiffTSST, a diffusion-based framework that disentangles a time series into content and style representations via convolutional encoders and recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series, enabling conditional generation of novel samples to achieve style transfer. We demonstrate both qualitatively and quantitatively that DiffTSST achieves effective style transfer. We further validate its real-world utility by showing that data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.



FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Ahasan, Md Mubtasim, Khan, Rafat Hasan, Mohiuddin, Tasnim, Chadha, Aman, Iqbal, Tariq, Amin, M Ashraful, Ali, Amin Ahsan, Islam, Md Mofijul, Rahman, A K M Mahbubur

arXiv.org Artificial Intelligence

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Tokenization is a cornerstone of natural language processing (NLP), enabling language models to represent text in discrete units for efficient autoregressive modeling and scalable downstream applications (Schmidt et al., 2024). Inspired by this paradigm, the speech domain has increasingly adopted neural codecs, popularized by Encodec (D efossez et al., 2022) and SoundStream (Zeghi-dour et al., 2022). However, learning discrete speech representations is more challenging than text due to the continuous and multidimensional nature of speech (Ju et al., 2024). While neural codecs learn acoustic representations (waveform and low-level signal characteristics), they struggle to capture high-level semantics, requiring downstream models to adopt additional self-supervised masked language objectives to derive semantic representations (phonetic content and linguistic meaning) (Borsos et al., 2023). Work does not relate to position at Amazon. Y et another fundamental aspect of human speech remains missing: speech is inherently grounded in context and surrounding cues (Brown et al., 2022).


SpeechOp: Inference-Time Task Composition for Generative Speech Processing

Lovelace, Justin, Kumar, Rithesh, Su, Jiaqi, Chen, Ke, Weinberger, Kilian Q, Jin, Zeyu

arXiv.org Artificial Intelligence

While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop


Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation

Lei, Yuanyuan, Huang, Ruihong

arXiv.org Artificial Intelligence

Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.


Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

Wang, Xintong, Liu, Yixiao, Pan, Jingheng, Ding, Liang, Wang, Longyue, Biemann, Chris

arXiv.org Artificial Intelligence

Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.