Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks

Zulkifli, Kahfi S., Qian, Wenbo, Zhu, Shaowei, Zhou, Yuan, Zhang, Zhen, Lou, Chang

Sep-16-2025–arXiv.org Artificial Intelligence

Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity, introducing silent errors that severely degrade model performance. Existing solutions are either ad hoc or too costly for production. We present Scalify, a lightweight framework that exposes silent errors by verifying semantic equivalence of computational graphs using equality saturation and Datalog-style reasoning. To scale, Scalify partitions graphs with parallel rewriting and layer memoization, reuses rewrite templates, and augments equality saturation with relational reasoning and symbolic bijection inference. It further localizes discrepancies to precise code sites, turning verification results into actionable debugging guidance. Scalify verifies models as large as Llama-3.1-405B within minutes on a commodity machine and exposed five unknown bugs in Amazon production machine learning frameworks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Sep-16-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - California (0.68)

Genre:
- Research Report (1.00)
- Workflow (0.68)

Industry:
- Information Technology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found