AITopics | fine-grained condition

Advanced Sign Language Video Generation with Compressed and Quantized Multi Condition

Neural Information Processing SystemsJun-18-2026, 10:56:54 GMT

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (e.g., skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (i.e., fine-grained poses and 3D hands). SignViP contains three core components.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.85)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
(3 more...)

Add feedback

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Neural Information Processing SystemsJun-12-2026, 18:47:09 GMT

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (e.g., skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporate multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (i.e., fine-grained poses and 3D hands). SignViP contains three core components.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Wang, Cong, Deng, Zexuan, Jiang, Zhiwei, Yin, Yafeng, Shen, Fei, Cheng, Zifeng, Ge, Shiping, Gan, Shiwei, Gu, Qing

arXiv.org Artificial IntelligenceNov-7-2025

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.1598

Country: Asia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.85)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Chen, Hejia, Zhang, Haoxian, Zhang, Shoulong, Liu, Xiaoqiang, Zhuang, Sisi, Zhang, Yuan, Wan, Pengfei, Zhang, Di, Li, Shuai

arXiv.org Artificial IntelligenceMar-13-2025

Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: https://harryxd2018.github.io/cafe-talk/

facial movement, fine-grained condition, fine-grained control, (16 more...)

arXiv.org Artificial Intelligence

2503.14517

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, Hsu, Wei-Ning

arXiv.org Artificial IntelligenceJun-10-2024

In this work, we propose Voicebox Adapter, Our contributions are as follows: (1) we propose Voicebox a novel approach that integrates fine-grained conditions into a Adapter, which augments Voicebox, a pre-trained speech pre-trained Voicebox speech generation model using a crossattention generation model, with fine-grained controllability; (2) we explore module. To ensure a smooth integration of newly different efficient fine-tuning methods to bridge the gap added modules with pre-trained ones, we explore various efficient between pre-trained parameters and new fine-grained conditioning fine-tuning approaches. Our experiment shows that the modules; (3) we show that Voicebox Adapter can generalize LoRA with bias-tuning configuration yields the best performance, across various fine-grained conditions, attaining performance enhancing controllability without compromising speech comparable to that achieved by fine-tuning the entire model quality. Across three fine-grained conditional generation tasks, with significantly fewer fine-tuned parameters; (4) we conduct we demonstrate the effectiveness and resource efficiency of experiments using varying amounts of fine-tuning data and different Voicebox Adapter. Follow-up experiments further highlight the hidden dimension sizes, analyzing the performance of robustness of Voicebox Adapter across diverse data setups.

fine-grained condition, module, voicebox adapter, (14 more...)

arXiv.org Artificial Intelligence

2406.06251

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

fine-grained condition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Advanced Sign Language Video Generation with Compressed and Quantized Multi Condition

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning