Rethinking Verification for Generation From Generation to Testing

Jun-22-2026, 18:02:21 GMT–Neural Information Processing Systems

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Jun-22-2026, 18:02:21 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (1.00)
- Asia > China (0.67)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found