ProofNet++: A Neuro-Symbolic System for Formal Proof Verification with Self-Correction
–arXiv.org Artificial Intelligence
Table I presents the quantitative evaluation of ProofNet++ across three distinct datasets. The FPSR (Final Proof Success Rate) metric shows that the system performs best on the mathlib-extract dataset with a 74.9% success rate, followed by miniF2F at 68.4%, and the HOL Light Testbed trailing at 63.5%. Similarly, the PPC (Proof Production Correctness) values align with this trend, indicating higher intermediate proof accuracy on mathlib-extract (88.0%) compared to the other datasets. The EDPT (Edit Distance to Proof Target) metric reveals that mathlib-extract proofs require fewer correction steps (2.4) than miniF2F (3.2) and HOL Light (4.0), suggesting that the system is more efficient in approximating correct proofs in that domain. Latency measurements reflect verifier runtime, with mathlib-extract exhibiting the fastest average verification time (176 ms), whereas HOL Light has the highest latency (214 ms). Lastly, the average proof length varies notably, with HOL Light proofs being the longest (14.3 steps), potentially contributing to its higher latency and lower success metrics. These results indicate that while ProofNet++ demonstrates strong performance on established libraries like mathlib-extract, there is room for improvement on datasets with more complex or longer proofs, such as HOL Light. Enhancements could focus on optimizing proof search strategies and reducing verifier latency, particularly for longer proofs, to improve overall efficiency and success rates. E. Benchmark Pipeline Overview Figure 1 illustrates the full evaluation pipeline used to benchmark ProofNet++, from the initial input prompt to the final corrected proof output.
arXiv.org Artificial Intelligence
Jun-2-2025