AITopics | style token

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > China (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Neural Information Processing SystemsDec-24-2025, 18:31:14 GMT

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e., TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with (1024x1024) size, dispensing with convolutions entirely.

image synthesis, token-based generator, visual quality, (7 more...)

Technology: Information Technology > Artificial Intelligence (1.00)

arXiv.org Artificial IntelligenceAug-19-2025

Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Sanjyal, Ankit

Diffusion models have become a powerful backbone for text-to-image generation, producing high-quality visuals from natural language prompts. However, when prompts involve multiple objects alongside global or local style instructions, the outputs often drift in style and lose spatial coherence, limiting their reliability for controlled, style-consistent scene generation. We present Local Prompt Adaptation (LPA), a lightweight, training-free method that splits the prompt into content and style tokens, then injects them selectively into the U-Net's attention layers at chosen timesteps. By conditioning object tokens early and style tokens later in the denoising process, LPA improves both layout control and stylistic uniformity without additional training cost. We conduct extensive ablations across parser settings and injection windows, finding that the best configuration -- lpa late only with a 300-650 step window -- delivers the strongest balance of prompt alignment and style consistency. On the T2I benchmark, LPA improves CLIP-prompt alignment over vanilla SDXL by +0.41% and over SD1.5 by +0.34%, with no diversity loss. On our custom 50-prompt style-rich benchmark, LPA achieves +0.09% CLIP-prompt and +0.08% CLIP-style gains over baseline. Our method is model-agnostic, easy to integrate, and requires only a single configuration change, making it a practical choice for controllable, style-consistent multi-object generation.

artificial intelligence, machine learning, natural language, (15 more...)

2507.20094

Country: Asia (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Neural Information Processing SystemsAug-16-2025, 21:20:32 GMT

Improving Visual Quality of Image Synthesis by AT oken-based Generator with Transformers

This work was done while Y anhong Zeng was a research intern at Microsoft Research Asia.

artificial intelligence, machine learning, natural language, (17 more...)

Country:

Asia > China (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Neural Information Processing SystemsJan-18-2025, 17:39:49 GMT

Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

image synthesis, token-based generator, transformer, (4 more...)

Technology: Information Technology > Artificial Intelligence > Vision (0.98)

Chandra, Shreeram Suresh, Du, Zongyang, Sisman, Berrak

Exploring speech style spaces with language models: Emotional TTS without emotion labels

arXiv.org Artificial IntelligenceMay-18-2024

Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.

emotion label, speech, style space, (15 more...)

2405.11413

Country:

North America > Canada > Quebec > Montreal (0.05)
North America > United States > Texas > Dallas County > Dallas (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Kumar, Neeraj, Narang, Ankur, Lall, Brejesh

Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN

arXiv.org Artificial IntelligenceOct-27-2023

In this paper, we present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps. It leverages the novel conditional prosodic layer normalization to incorporate the style embeddings into the multi head attention based phoneme encoder and mel spectrogram decoder based generator architecture to generate the speech. The style embedding is generated by fine tuning the pretrained BERT model on auxiliary tasks such as pitch, speaking speed, emotion,gender classifications. We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.

artificial intelligence, machine learning, natural language, (13 more...)

2310.18169

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Vision (0.67)

Kim, Daegyeom, Hong, Seongho, Choi, Yong-Hoon

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

arXiv.org Artificial IntelligenceJul-19-2023

Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these attributes. To evaluate the performance of SC VALL-E, we conduct comparative experiments with three representative expressive speech synthesis models: global style token (GST) Tacotron2, variational autoencoder (VAE) Tacotron2, and original VALL-E. We measure word error rate (WER), F0 voiced error (FVE), and F0 gross pitch error (F0GPE) as evaluation metrics to assess the accuracy of generated sentences. For comparing the quality of synthesized speech, we measure comparative mean option score (CMOS) and similarity mean option score (SMOS). To evaluate the style control ability of the generated speech, we observe the changes in F0 and mel-spectrogram by modifying the trained tokens. When using prompt audio that is not present in the training data, SC VALL-E generates a variety of expressive sounds and demonstrates competitive performance compared to the existing models. Our implementation, pretrained models, and audio samples are located on GitHub.

artificial intelligence, machine learning, natural language, (19 more...)

2307.1055

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
North America > United States > Rhode Island (0.04)
North America > Canada > Ontario > Toronto (0.04)
(9 more...)

Genre: Research Report (0.64)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Hortal, Enrique, Alarcia, Rodrigo Brechard

GANtron: Emotional Speech Synthesis with Generative Adversarial Networks

arXiv.org Artificial IntelligenceOct-6-2021

Speech synthesis is used in a wide variety of industries. Nonetheless, it always sounds flat or robotic. The state of the art methods that allow for prosody control are very cumbersome to use and do not allow easy tuning. To tackle some of these drawbacks, in this work we target the implementation of a text-to-speech model where the inferred speech can be tuned with the desired emotions. To do so, we use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism. We evaluate four different configurations considering different inputs and training strategies, study them and prove how our best model can generate speech files that lie in the same distribution as the initial training dataset. Additionally, a new strategy to boost the training convergence by applying a guided attention loss is proposed.

dataset, emotional content, style token, (15 more...)

2110.0339

Country:

Europe > Netherlands > Limburg > Maastricht (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Stanton, Daisy, Wang, Yuxuan, Skerry-Ryan, RJ

Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

arXiv.org Machine LearningAug-3-2018

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicted Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as "virtual" speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training nor auxiliary inputs for inference. We show that, when trained on a dataset of expressive speech, our system generates audio with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples for each of our findings.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

1808.0141

Country:

North America > United States > California > Santa Clara County > Mountain View (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Media (0.37)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)