AITopics | Automatic Programming

Collaborating Authors

Automatic Programming

"Computer programming is the process of constructing executable code from fragmentary information. ... When computer programming is done by a machine, the process is called automatic programming. AI researchers are interested in studying automatic programming for two reasons: First, it would be highly useful to have a powerful automatic programming systems that could receive casual and imprecise specifications for a desired target program and then correctly generate that program; second, automatic programming is widely believed to be a necessary component of any intelligent system and is therefore a topic for fundamental research in its own right."
– excerpt from Biermann, A. 1992. Automatic Programming. In Encyclopedia of Artificial Intelligence. 2nd edition, Stuart C. Shapiro, editor, 18 - 35. New York: John Wiley & Sons.

News Overviews Instructional Materials AI-Alerts Classics

Natural Language to Code Generation in Interactive Data Science Notebooks

Yin, Pengcheng, Li, Wen-Ding, Xiao, Kefan, Rao, Abhishek, Wen, Yeming, Shi, Kensen, Howland, Joshua, Bailey, Paige, Catasta, Michele, Michalewski, Henryk, Polozov, Alex, Sutton, Charles

arXiv.org Artificial IntelligenceDec-19-2022

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2212.09248

Country:

Asia > Middle East > Israel (0.04)
Asia > India > West Bengal > Kolkata (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
(10 more...)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.61)

Add feedback

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Cassano, Federico, Gouwar, John, Nguyen, Daniel, Nguyen, Sydney, Phipps-Costin, Luna, Pinckney, Donald, Yee, Ming-Ho, Zi, Yangtian, Anderson, Carolyn Jane, Feldman, Molly Q, Guha, Arjun, Greenberg, Michael, Jangda, Abhinav

arXiv.org Artificial IntelligenceDec-19-2022

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

benchmark, machine learning, programming language, (19 more...)

arXiv.org Artificial Intelligence

2208.08227

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.45)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

NP4G : Network Programming for Generalization

Hara, Shoichiro, Watanabe, Yuji

arXiv.org Artificial IntelligenceDec-8-2022

Automatic programming has been actively studied for a long time by various approaches including genetic programming. In recent years, automatic programming using neural networks such as GPT-3 has been actively studied and is attracting a lot of attention. However, these methods are illogical inference based on experience by enormous learning, and their thinking process is unclear. Even using the method by logical inference with a clear thinking process, the system that automatically generates any programs has not yet been realized. Especially, the inductive inference generalized by logical inference from one example is an important issue that the artificial intelligence can acquire knowledge by itself. In this study, we propose NP4G: Network Programming for Generalization, which can automatically generate programs by inductive inference. Because the proposed method can realize "sequence", "selection", and "iteration" in programming and can satisfy the conditions of the structured program theorem, it is expected that NP4G is a method automatically acquire any programs by inductive inference. As an example, we automatically construct a bitwise NOT operation program from several training data by generalization using NP4G. Although NP4G only randomly selects and connects nodes, by adjusting the number of nodes and the number of phase of "Phased Learning", we show the bitwise NOT operation programs are acquired in a comparatively short time and at a rate of about 7 in 10 running. The source code of NP4G is available on GitHub as a public repository.

artificial intelligence, machine learning, node, (15 more...)

arXiv.org Artificial Intelligence

2212.11118

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)

Genre: Research Report (0.71)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Coder Reviewer Reranking for Code Generation

Zhang, Tianyi, Yu, Tao, Hashimoto, Tatsunori B., Lewis, Mike, Yih, Wen-tau, Fried, Daniel, Wang, Sida I.

arXiv.org Artificial IntelligenceNov-29-2022

Sampling diverse programs from a code language model and reranking with model likelihood is a popular method for code generation but it is prone to preferring degenerate solutions. Inspired by collaborative programming, we propose Coder-Reviewer reranking. We augment Coder language models from past work, which generate programs given language instructions, with Reviewer models, which evaluate the likelihood of the instruction given the generated programs. We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk method. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameters.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2211.1649

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.63)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

CodeT: Code Generation with Generated Tests

Chen, Bei, Zhang, Fengji, Nguyen, Anh, Zan, Daoguang, Lin, Zeqi, Lou, Jian-Guang, Chen, Weizhu

arXiv.org Artificial IntelligenceNov-23-2022

The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select the most appropriate solution from the multiple samples generated by the pretrained language models. A natural way to evaluate the quality and correctness of a code solution is to run it against a set of test cases, but the manual creation of such test cases is often costly and time-consuming. T, that leverages the same pre-trained language models to automatically generate test cases for the code samples, thus reducing the human effort and increasing the coverage of the test scenarios. T then executes the code samples using the generated test cases and performs a dual execution agreement, which considers both the consistency of the outputs against the generated test cases and the agreement of the outputs with other code samples. We conduct comprehensive experiments on four benchmarks, HumanEval, MBPP, APPS, and CodeContests, using five different pre-trained language models with varying sizes and capabilities. T can significantly improve the performance of code solution selection over previous methods, achieving remarkable and consistent gains across different models and benchmarks. T improves the pass@1 metric on HumanEval to 65.8%, which represents an absolute improvement of 18.8% over the code-davinci-002 model, and an absolute improvement of more than 20% over the previous state-of-the-art results. Despite the remarkable progress in pre-training techniques for code generation, selecting a single correct solution from multiple candidates generated by large language models remains a hard problem. For instance, Codex (Chen et al., 2021), a state-of-the-art pre-trained language model for code generation, can achieve a pass@100 (pass if one or more among 100 generated solutions for a given problem can pass the corresponding test cases) of 77.4%, but a pass@1 (correct rate of a single solution) of only 33.5% on the HumanEval benchmark (Chen et al., 2021) A straightforward way to verify the correctness of a solution is to execute it and check if it passes all corresponding test cases. This execution-guided approach has been widely adopted in various code-related tasks, such as code generation (Chen et al., 2021; Li et al., 2022b; Shi et al., 2022), code translation (Roziere et al., 2021), and program synthesis (Chen et al., 2018; Ellis et al., 2019). However, this approach relies heavily on the quality and quantity of test cases, which are often costly and time-consuming to create and maintain. Therefore, we propose to automatically generate test cases for arbitrary programming problems and use them to quickly verify any solution. The first three authors contributed equally. We report the results on the HumanEval benchmark with the Codex model code-cushman-001.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2207.10397

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, Yu, Tao

arXiv.org Artificial IntelligenceNov-18-2022

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

large language model, logic & formal reasoning, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2211.11501

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.63)
(2 more...)

Add feedback

Execution-based Evaluation for Data Science Code Generation Models

Huang, Junjie, Wang, Chenglong, Zhang, Jipeng, Yan, Cong, Cui, Haotian, Inala, Jeevana Priya, Clement, Colin, Duan, Nan, Gao, Jianfeng

arXiv.org Artificial IntelligenceNov-17-2022

Code generation models can benefit data scientists' productivity by automatically generating code from context and text descriptions. An important measure of the modeling progress is whether a model can generate code that can correctly execute to solve the task. However, due to the lack of an evaluation dataset that directly supports execution-based model evaluation, existing work relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model selection, which can be inaccurate. To remedy this, we introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks. ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and the desired execution output. With ExeDS, we evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores. Our experiments show that models with high surface-form scores do not necessarily perform well on execution metrics, and execution-based metrics can better capture model code generation errors. Source code and data can be found at https://github.com/Jun-jie-Huang/ExeDS

artificial intelligence, automatic programming, exeds, (17 more...)

arXiv.org Artificial Intelligence

2211.09374

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China > Hong Kong (0.04)
North America > Dominican Republic (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)

Add feedback

Evaluating How Fine-tuning on Bimodal Data Effects Code Generation

Orlanski, Gabriel, Yang, Seonhye, Healy, Michael

arXiv.org Artificial IntelligenceNov-14-2022

Despite the increase in popularity of language models for code generation, it is still unknown how training on bimodal coding forums affects a model's code generation performance and reliability. We, therefore, collect a dataset of over 2.2M StackOverflow questions with answers for finetuning. These fine-tuned models have average $pass@k$ improvements of 54.64% and 85.35% on the HumanEval (Chen et al., 2021) and Mostly Basic Program Problems (Austin et al., 2021) tasks, respectively. This regime further decreases the number of generated programs with both syntax and runtime errors. However, we find that at higher temperatures, there are significant decreases to the model's ability to generate runnable programs despite higher $pass@k$ scores, underscoring the need for better methods of incorporating such data that mitigate these side effects. The code can be found https://github.com/gabeorlanski/bimodalcode-generation

artificial intelligence, automatic programming, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.07842

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Berlin (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.83)

Add feedback

Exploring and Evaluating Personalized Models for Code Generation

Zlotchevski, Andrei, Drain, Dawn, Svyatkovskiy, Alexey, Clement, Colin, Sundaresan, Neel, Tufano, Michele

arXiv.org Artificial IntelligenceSep-19-2022

Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain -- for example, question-answering on a given topic -- generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model's parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3540250.3558959

2208.13928

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Quebec > Montreal (0.14)
Asia > Singapore > Central Region > Singapore (0.05)
(3 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.46)
Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation

Shen, Sijie, Zhu, Xiang, Dong, Yihong, Guo, Qizhi, Zhen, Yankun, Li, Ge

arXiv.org Artificial IntelligenceAug-22-2022

Code generation aims to generate a code snippet automatically from natural language descriptions. Generally, the mainstream code generation methods rely on a large amount of paired training data, including both the natural language description and the code. However, in some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data, and a lot of effort is required to manually write the code descriptions to construct a high-quality training dataset. Due to the limited training data, the generation model cannot be well trained and is likely to be overfitting, making the model's performance unsatisfactory for real-world use. To this end, in this paper, we propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model by extending the original TranX model to support subtoken-level code generation. To verify our proposed approach, we collect a real-world code generation dataset and conduct experiments on it. Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset, and the exact match accuracy of Subtoken-TranX improves significantly by 12.75% with the help of our task augmentation method. The model performance on several code categories has satisfied the requirements for application in industrial systems. Our proposed approach has been adopted by Alibaba's BizCook platform. To the best of our knowledge, this is the first domain code generation system adopted in industrial development environments.

category, code generation, expression, (12 more...)

arXiv.org Artificial Intelligence

2208.10091

Country:

Asia > Singapore (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback