Goto

Collaborating Authors

 Animation


6 science milestones turning 40 this year

Popular Science

In 1986, we had huge leaps forward, tragic steps back, and life changing innovations. NASA's STS-51L crew members pose for photographs during a break in countdown training at the White Room, Launch Complex 39, Pad B. Left to right are Teacher-in-Space payload specialist Sharon Christa McAuliffe; payload specialist Gregory Jarvis; and astronauts Judith A. Resnik, mission specialist; Francis R. (Dick) Scobee, mission commander; Ronald E. McNair, mission specialist; Mike J. Smith, pilot; and Ellison S. Onizuka, mission specialist. Breakthroughs, discoveries, and DIY tips sent every weekday. It was a year that saw roughly six million Americans hold hands in a continuous (more or less) line across the country to raise money for homelessness. A news anchor named Oprah Winfrey debuted her new talk show.


CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes

Neural Information Processing Systems

Causal video question answering (QA) has garnered increasing interest, yet existing datasets often lack depth in causal reasoning. To address this gap, we capitalize on the unique properties of cartoons and construct CausalChaos!, a novel, challenging causal Why-QA dataset built upon the iconic Tom and Jerry cartoon series. Cartoons use the principles of animation that allow animators to create expressive, unambiguous causal relationships between events to form a coherent storyline. Utilizing these properties, along with thought-provoking questions and multi-level answers (answer and detailed causal explanation), our questions involve causal chains that interconnect multiple dynamic interactions between characters and visual scenes. These factors demand models to solve more challenging, yet well-defined causal relationships. We also introduce hard incorrect answer mining, including a causally confusing version that is even more challenging. While models perform well, there is much room for improvement, especially, on open-ended answers. We identify more advanced/explicit causal relationship modeling \& joint modeling of vision and language as the immediate areas for future efforts to focus upon. Along with the other complementary datasets, our new challenging dataset will pave the way for these developments in the field.


The Truth About the em Avatar /em Movies That No One Wants to Accept

Slate

James Cameron is desperate to convince the world that these movies aren't "cartoons." Enter your email to receive alerts for this author. You can manage your newsletter subscriptions at any time. You're already subscribed to the aa_Sam_Adams newsletter. You can manage your newsletter subscriptions at any time.


PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles

Han, Tianshun, Zhou, Benjia, Liu, Ajian, Liang, Yanyan, Zhang, Du, Lei, Zhen, Wan, Jun

arXiv.org Artificial Intelligence

PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.


AnimAgents: Coordinating Multi-Stage Animation Pre-Production with Human-Multi-Agent Collaboration

Wang, Wen-Fan, Lu, Chien-Ting, Ng, Jin Ping, Chiu, Yi-Ting, Lee, Ting-Ying, Wang, Miaosen, Chen, Bing-Yu, Chen, Xiang 'Anthony'

arXiv.org Artificial Intelligence

Animation pre-production lays the foundation of an animated film by transforming initial concepts into a coherent blueprint across interdependent stages such as ideation, scripting, design, and storyboarding. While generative AI tools are increasingly adopted in this process, they remain isolated, requiring creators to juggle multiple systems without integrated workflow support. Our formative study with 12 professional creative directors and independent animators revealed key challenges in their current practice: Creators must manually coordinate fragmented outputs, manage large volumes of information, and struggle to maintain continuity and creative control between stages. Based on the insights, we present AnimAgents, a human-multi-agent collaborative system that coordinates complex, multi-stage workflows through a core agent and specialized agents, supported by dedicated boards for the four major stages of pre-production. AnimAgents enables stage-aware orchestration, stage-specific output management, and element-level refinement, providing an end-to-end workflow tailored to professional practice. In a within-subjects summative study with 16 professional creators, AnimAgents significantly outperformed a strong single-agent baseline that equipped with advanced parallel image generation in coordination, consistency, information management, and overall satisfaction (p < .01). A field deployment with 4 creators further demonstrated AnimAgents' effectiveness in real-world projects.


Once Upon an AI: Six Scaffolds for Child-AI Interaction Design, Inspired by Disney

Kurian, Nomisha

arXiv.org Artificial Intelligence

To build AI that children can intuitively understand and benefit from, designers need a design grammar that serves their developmental needs. This paper bridges artificial intelligence design for children - an emerging field still defining its best practices - and animation, a well established field with decades of experience in engaging children through accessible storytelling. Pairing Piagetian developmental theory with design pattern extraction from 52 works of animation, the paper presents a six scaffold framework that integrates design insights transferable to child centred AI design: (1) signals for visual animacy and clarity, (2) sound for musical and auditory scaffolding, (3) synchrony in audiovisual cues, (4) sidekick style personas, (5) storyplay that supports symbolic play and imaginative exploration, and (6) structure in the form of predictable narratives. These strategies, long refined in animation, function as multimodal scaffolds for attention, understanding, and attunement, supporting learning and comfort. This structured design grammar is transferable to AI design. By reframing cinematic storytelling and child development theory as design logic for AI, the paper offers heuristics for AI that aligns with the cognitive stages and emotional needs of young users. The work contributes to design theory by showing how sensory, affective, and narrative techniques can inform developmentally attuned AI design. Future directions include empirical testing, cultural adaptation, and participatory co design.


Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation

Mao, Yuxiang, Zhang, Zhijie, Zhang, Zhiheng, Liu, Jiawei, Zeng, Chen, Xia, Shihong

arXiv.org Artificial Intelligence

Expressions are fundamental to conveying human emotions. With the rapid advancement of AI-generated content (AIGC), realistic and expressive 3D facial animation has become increasingly crucial. Despite recent progress in speech-driven lip-sync for talking-face animation, generating emotionally expressive talking faces remains underexplored. A major obstacle is the scarcity of real emotional 3D talking-face datasets due to the high cost of data capture. To address this, we model facial animation driven by both speech and emotion as a linear additive problem. Leveraging a 3D talking-face dataset with neutral expressions (VOCAset) and a dataset of 3D expression sequences (Florence4D), we jointly learn a set of blendshapes driven by speech and emotion. We introduce a sparsity constraint loss to encourage disentanglement between the two types of blendshapes while allowing the model to capture inherent secondary cross-domain deformations present in the training data. The learned blendshapes can be further mapped to the expression and jaw pose parameters of the FLAME model, enabling the animation of 3D Gaussian avatars. Qualitative and quantitative experiments demonstrate that our method naturally generates talking faces with specified expressions while maintaining accurate lip synchronization. Perceptual studies further show that our approach achieves superior emotional expressivity compared to existing methods, without compromising lip-sync quality.


Environment-aware Motion Matching

Ponton, Jose Luis, Andrews, Sheldon, Andujar, Carlos, Pelechano, Nuria

arXiv.org Artificial Intelligence

Interactive applications demand believable characters that respond naturally to dynamic environments. Traditional character animation techniques often struggle to handle arbitrary situations, leading to a growing trend of dynamically selecting motion-captured animations based on predefined features. While Motion Matching has proven effective for locomotion by aligning to target trajectories, animating environment interactions and crowd behaviors remains challenging due to the need to consider surrounding elements. Existing approaches often involve manual setup or lack the naturalism of motion capture. Furthermore, in crowd animation, body animation is frequently treated as a separate process from trajectory planning, leading to inconsistencies between body pose and root motion. To address these limitations, we present Environment-aware Motion Matching, a novel real-time system for full-body character animation that dynamically adapts to obstacles and other agents, emphasizing the bidirectional relationship between pose and trajectory. In a preprocessing step, we extract shape, pose, and trajectory features from a motion capture database. At runtime, we perform an efficient search that matches user input and current pose while penalizing collisions with a dynamic environment. Our method allows characters to naturally adjust their pose and trajectory to navigate crowded scenes.


Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Seo, Junyoung, Mira, Rodrigo, Haliassos, Alexandros, Bounareli, Stella, Chen, Honglie, Tran, Linh, Kim, Seungryong, Landgraf, Zoe, Shen, Jie

arXiv.org Artificial Intelligence

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Audio-driven human animation aims to generate realistic human videos synchronized with input audio, with widespread applications in film production, virtual assistants, and digital content creation. The advent of Diffusion Transformers (DiTs) (Peebles & Xie, 2022) has significantly advanced this field, enabling natural human video generation not only for portrait videos but also in diverse environments with complex backgrounds (Xu et al., 2024; Chen et al., 2025a). However, current DiT -based models can only handle short clips at a time, typically around 5 seconds, due to the quadratic complexity of diffusion transformer architectures.


Multi-identity Human Image Animation with Structural Video Diffusion

Wang, Zhenzhi, Li, Yixuan, Zeng, Yanhong, Guo, Yuwei, Lin, Dahua, Xue, Tianfan, Dai, Bo

arXiv.org Artificial Intelligence

Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. T o address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embed-dings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation. Code is available at here.