FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention

Dai, Huangliang, Wu, Shixun, Huang, Jiajun, Jian, Zizhe, Zhu, Yue, Hu, Haiyang, Chen, Zizhong

Aug-14-2025–arXiv.org Artificial Intelligence

Transformer models rely on High-Performance Computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for Transformers are designed at the operation level without architectural optimization, leading to significant computational and memory overhead, which in turn reduces protection efficiency and limits scalability to larger models. In this paper, we implement module-level protection for Transformers by treating the operations within the attention module as a single kernel and applying end-to-end fault tolerance. This method provides unified protection across multi-step computations, while achieving comprehensive coverage of potential errors in the nonlinear computations. For linear modules, we design a strided algorithm-based fault tolerance (ABFT) that avoids inter-thread communication. Experimental results show that our end-to-end fault tolerance achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Aug-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.46)
- North America > United States
  - California (0.15)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology
  - Architecture (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)