A Library for Representing Python Programs as Graphs for Machine Learning
Bieber, David, Shi, Kensen, Maniatis, Petros, Sutton, Charles, Hellendoorn, Vincent, Johnson, Daniel, Tarlow, Daniel
–arXiv.org Artificial Intelligence
A standard class of approaches in applying machine learning to code is to construct a graph representation of a program, and then to perform the analysis of interest on that graph representation, learning from a large dataset of labeled example programs. Graph representations of programs used for machine learning include the abstract syntax tree (AST), control-flow graph (CFG), data-flow graphs, inter-procedural control-flow graph (ICFG), interval graph, and composite "program graphs" that encode information from multiple of the aforementioned graphs, possibly with additional program-derived data. The python_graphs library directly allows for the construction of some of these graph types (e.g., control-flow graphs and composite program graphs) from arbitrary Python programs, and it provides tools that aid in constructing the others. It has been used successfully in a variety of machine learning for code publications, and we make it available as free and open source software to allow for broader use. In Section 2 we present an overview of the use of graph representations of code in machine learning. In Section 3 we describe the capabilities (Section 3.1), possible extensions (Section 3.2), and limitations (Section 3.3) of python_graphs. Section 4 highlights the applications of python_graphs for machine learning research. Section 5 presents a case study applying python_graphs to 3.3 million programs from Project CodeNet [28].
arXiv.org Artificial Intelligence
Aug-15-2022