content preservation
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
- Research Report > Experimental Study (0.93)
- Workflow (0.67)
- Information Technology (0.92)
- Media > Photography (0.42)
79ec2a4246feb2126ecf43c4a4418002-Paper.pdf
Weformulate the decoding process asanoptimization problem which allows for multiple attributesweaimtocontrol tobeeasilyincorporated asdifferentiable constraints to the optimization. By relaxing this discrete optimization to a continuous one, we make use of Lagrangian multipliers and gradient-descent based techniques to generate the desired text.
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Michigan (0.04)
- North America > Canada > Quebec > Montreal (0.04)
DiffStyleTS: Diffusion Model for Style Transfer in Time Series
Nagda, Mayank, Ostheimer, Phil, Arweiler, Justus, Jungjohann, Indra, Werner, Jennifer, Wagner, Dennis, Muraleedharan, Aparna, Jafari, Pouya, Schmid, Jochen, Jirasek, Fabian, Burger, Jakob, Bortz, Michael, Hasse, Hans, Mandt, Stephan, Kloft, Marius, Fellenz, Sophie
Style transfer combines the content of one signal with the style of another. It supports applications such as data augmentation and scenario simulation, helping machine learning models generalize in data-scarce domains. While well developed in vision and language, style transfer methods for time series data remain limited. We introduce DiffTSST, a diffusion-based framework that disentangles a time series into content and style representations via convolutional encoders and recombines them through a self-supervised attention-based diffusion process. At inference, encoders extract content and style from two distinct series, enabling conditional generation of novel samples to achieve style transfer. We demonstrate both qualitatively and quantitatively that DiffTSST achieves effective style transfer. We further validate its real-world utility by showing that data augmentation with DiffTSST improves anomaly detection in data-scarce regimes.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California (0.04)
- (8 more...)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (3 more...)
- Research Report > Experimental Study (0.93)
- Workflow (0.93)
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Ahasan, Md Mubtasim, Khan, Rafat Hasan, Mohiuddin, Tasnim, Chadha, Aman, Iqbal, Tariq, Amin, M Ashraful, Ali, Amin Ahsan, Islam, Md Mofijul, Rahman, A K M Mahbubur
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Tokenization is a cornerstone of natural language processing (NLP), enabling language models to represent text in discrete units for efficient autoregressive modeling and scalable downstream applications (Schmidt et al., 2024). Inspired by this paradigm, the speech domain has increasingly adopted neural codecs, popularized by Encodec (D efossez et al., 2022) and SoundStream (Zeghi-dour et al., 2022). However, learning discrete speech representations is more challenging than text due to the continuous and multidimensional nature of speech (Ju et al., 2024). While neural codecs learn acoustic representations (waveform and low-level signal characteristics), they struggle to capture high-level semantics, requiring downstream models to adopt additional self-supervised masked language objectives to derive semantic representations (phonetic content and linguistic meaning) (Borsos et al., 2023). Work does not relate to position at Amazon. Y et another fundamental aspect of human speech remains missing: speech is inherently grounded in context and surrounding cues (Brown et al., 2022).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (4 more...)
SpeechOp: Inference-Time Task Composition for Generative Speech Processing
Lovelace, Justin, Kumar, Rithesh, Su, Jiaqi, Chen, Ke, Weinberger, Kilian Q, Jin, Zeyu
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop
- Information Technology > Artificial Intelligence > Natural Language (0.95)
- Information Technology > Artificial Intelligence > Vision (0.89)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation
Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Texas > Brazos County > College Station (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (17 more...)
- Media > News (1.00)
- Law (0.93)
- Government > Regional Government > North America Government > United States Government (0.68)
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites
Wang, Xintong, Liu, Yixiao, Pan, Jingheng, Ding, Liang, Wang, Longyue, Biemann, Chris
Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)