AITopics | text-to-video generation

Collaborating Authors

text-to-video generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

edbeca7811f9365c924c72a8a9bce83a-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 15:11:49 GMT

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Robots (0.93)
(2 more...)

Add feedback

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality

Neural Information Processing SystemsFeb-16-2026, 18:23:20 GMT

Despite these strides, evaluating these models poses substantial challenges.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Hong Kong (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Communications (0.68)
(2 more...)

Add feedback

Free-Bloom: Zero-ShotText-to-VideoGenerator withLLMDirectorandLDMAnimator

Neural Information Processing SystemsFeb-11-2026, 20:58:26 GMT

This study focuses on zero-shot text-to-video generation considering the data-and cost-efficient.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Rheinland-Pfalz > Mainz (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Mixture of Contexts for Long Video Generation

Cai, Shengqu, Yang, Ceyuan, Zhang, Lvmin, Guo, Yuwei, Xiao, Junfei, Yang, Ziyan, Xu, Yinghao, Yang, Zhenheng, Yuille, Alan, Guibas, Leonidas, Agrawala, Maneesh, Jiang, Lu, Wetzstein, Gordon

arXiv.org Artificial IntelligenceDec-10-2025

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.21058

Genre:

Research Report (0.50)
Instructional Material > Course Syllabus & Notes (0.34)

Industry: Media (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

TempoControl: Temporal Attention Guidance for Text-to-Video Models

Schiber, Shira, Lindenbaum, Ofir, Schwartz, Idan

arXiv.org Artificial IntelligenceDec-8-2025

Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation. Please see our project page for more details: https://shira-schiber.github.io/TempoControl/.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.02226

Genre: Research Report > New Finding (0.68)

Industry:

Transportation (0.70)
Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(2 more...)

Add feedback

Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation

Liu, Ruiying, Liang, Yuanzhi, Huang, Haibin, Yu, Tianshu, Zhang, Chi

arXiv.org Artificial IntelligenceNov-25-2025

Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual-visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores. Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants. Among these, Group Relative Policy Optimization (GRPO) Shao et al. (2024); Guo et al. (2025) has emerged as a promising framework, providing stable optimization and noticeable gains in visual quality, motion smoothness, and temporal coherence Xue et al. (2025). However, despite these advances, the semantic alignment between text prompts and generated videos and images remains a persistent weakness--models often produce visually plausible yet semantically mismatched results.

arxiv preprint arxiv, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2511.18919

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Video Text Preservation with Synthetic Text-Rich Videos

Liu, Ziyang, Valencia, Kevin, Cui, Justin

arXiv.org Artificial IntelligenceNov-11-2025

While T ext-T o-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and previous attempts to address this problem are computationally expensive and not suitable for video generation. In this work, we investigate a lightweight approach to improve T2V diffusion models using synthetic supervision. W e first generate text-rich images using a text-to-image (T2I) diffusion model, then animate them into short videos using a text-agnostic image-to-video (I2v) model. These synthetic video-prompt pairs are used to fine-tune W an2.1, a pre-trained T2V model, without any architectural changes. Our results show improvement in short-text legibility and temporal consistency with emerging structural priors for longer text. These findings suggest that curated synthetic data and weak supervision offer a practical path toward improving textual fidelity in T2V generation.

diffusion model, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.05573

Country: North America > United States > California > Los Angeles County > Los Angeles (0.86)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

VC4VG: Optimizing Video Captions for Text-to-Video Generation

Du, Yang, Lin, Zhuoran, Song, Kaiqiang, Wang, Biao, Zheng, Zhicheng, Ge, Tiezheng, Zheng, Bo, Jin, Qin

arXiv.org Artificial IntelligenceOct-31-2025

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/alimama-creative/VC4VG to support further research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.24134

Country: Asia > China (0.04)

Genre: Research Report (1.00)

Technology: