AITopics | Jangda, Abhinav

Collaborating Authors

Jangda, Abhinav

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Cassano, Federico, Gouwar, John, Lucchetti, Francesca, Schlesinger, Claire, Anderson, Carolyn Jane, Greenberg, Michael, Jangda, Abhinav, Guha, Arjun

arXiv.org Artificial IntelligenceDec-11-2023

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2308.09895

Country:

North America > United States (0.68)
Europe (0.46)

Genre: Research Report (0.82)

Industry: Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Cassano, Federico, Gouwar, John, Nguyen, Daniel, Nguyen, Sydney, Phipps-Costin, Luna, Pinckney, Donald, Yee, Ming-Ho, Zi, Yangtian, Anderson, Carolyn Jane, Feldman, Molly Q, Guha, Arjun, Greenberg, Michael, Jangda, Abhinav

arXiv.org Artificial IntelligenceDec-19-2022

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.

benchmark, machine learning, programming language, (17 more...)

arXiv.org Artificial Intelligence

2208.08227

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.45)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Predicting Variable Types in Dynamically Typed Programming Languages

Jangda, Abhinav, Anand, Gaurav

arXiv.org Artificial IntelligenceJan-16-2019

Dynamic Programming Languages are quite popular because they increase the programmer's productivity. However, the absence of types in the source code makes the program written in these languages difficult to understand and virtual machines that execute these programs cannot produced optimized code. To overcome this challenge, we develop a technique to predict types of all identifiers including variables, and function return types. We propose the first implementation of $2^{nd}$ order Inside Outside Recursive Neural Networks with two variants (i) Child-Sum Tree-LSTMs and (ii) N-ary RNNs that can handle large number of tree branching. We predict the types of all the identifiers given the Abstract Syntax Tree by performing just two passes over the tree, bottom-up and top-down, keeping both the content and context representation for all the nodes of the tree. This allows these representations to interact by combining different paths from the parent, siblings and children which is crucial for predicting types. Our best model achieves 44.33\% across 21 classes and top-3 accuracy of 71.5\% on our gathered Python data set from popular Python benchmarks.

deep learning, neural network, representation, (19 more...)

arXiv.org Artificial Intelligence

1901.05138

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback