Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Seo, Junyoung, Mira, Rodrigo, Haliassos, Alexandros, Bounareli, Stella, Chen, Honglie, Tran, Linh, Kim, Seungryong, Landgraf, Zoe, Shen, Jie
–arXiv.org Artificial Intelligence
Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Audio-driven human animation aims to generate realistic human videos synchronized with input audio, with widespread applications in film production, virtual assistants, and digital content creation. The advent of Diffusion Transformers (DiTs) (Peebles & Xie, 2022) has significantly advanced this field, enabling natural human video generation not only for portrait videos but also in diverse environments with complex backgrounds (Xu et al., 2024; Chen et al., 2025a). However, current DiT -based models can only handle short clips at a time, typically around 5 seconds, due to the quadratic complexity of diffusion transformer architectures.
arXiv.org Artificial Intelligence
Oct-28-2025
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Media (0.48)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.48)
- Natural Language (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Graphics > Animation (0.76)
- Artificial Intelligence
- Information Technology