WriteViT: Handwritten Text Generation with Vision Transformer

Nam, Dang Hoai, Khoa, Huynh Tong Dang, Duy, Vo Nguyen Le

arXiv.org Artificial Intelligence 

Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese--a language rich in diacritics and complex typography--remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. Preprint submitted to arXiv May 31, 2025 1. Introduction Despite significant technological advancements, handwritten text continues to play a critical role in various domains, including historical archiving, form processing, and educational assessment. Consequently, handwritten text recognition (HTR) remains a key area of research in document analysis. However, the task poses persistent challenges due to the inherent variability of handwriting.