pynguin
CoverUp: Coverage-Guided LLM-Based Test Generation
Pizzorno, Juan Altmayer, Berger, Emery D.
This paper presents CoverUp, a novel system that drives the generation of high-coverage Python regression tests via a combination of coverage analysis and large-language models (LLMs). CoverUp iteratively improves coverage, interleaving coverage analysis with dialogs with the LLM to focus its attention on as yet uncovered lines and branches. The resulting test suites significantly improve coverage over the current state of the art: compared to CodaMosa, a hybrid LLM / search-based software testing system, CoverUp substantially improves coverage across the board. On a per-module basis, CoverUp achieves median line coverage of 81% (vs. 62%), branch coverage of 53% (vs. 35%) and line+branch coverage of 78% (vs. 55%). We show that CoverUp's iterative, coverage-guided approach is crucial to its effectiveness, contributing to nearly half of its successes.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > United States > California > Los Angeles County > Pasadena (0.04)
Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools
Bhatia, Shreya, Gandhi, Tarushi, Kumar, Dhruv, Jalote, Pankaj
Generating unit tests is a crucial task in software development, demanding substantial time and effort from programmers. The advent of Large Language Models (LLMs) introduces a novel avenue for unit test script generation. This research aims to experimentally investigate the effectiveness of LLMs, specifically exemplified by ChatGPT, for generating unit test scripts for Python programs, and how the generated test cases compare with those generated by an existing unit test generator (Pynguin). For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code. The generated test cases are evaluated based on criteria such as coverage, correctness, and readability. Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage. At the same time, ChatGPT's ability to generate tests is superior to Pynguin, as the latter is not able to generate test cases for Category 1. We also find that about 39% and 28% of assertions generated by ChatGPT for Category 2 and 3, respectively, were incorrect. Our results also show that there is minimal overlap in missed statements between ChatGPT and Pynguin, thus, suggesting that a combination of both tools may enhance unit test generation performance. Finally, prompt engineering improved ChatGPT's performance, achieving an average 28% coverage improvement in Category 2 and 15% improvement in Category 3 after about 4 iterations.
- Europe > Portugal > Lisbon > Lisbon (0.05)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > India > NCT > Delhi (0.05)
- (5 more...)