Goto

Collaborating Authors

 project codenet


What is IBM's Project CodeNet?

#artificialintelligence

At its recently concluded Think 2021 conference, IBM introduced Project CodeNet to develop machine learning models that can help in programming. The large dataset consists of 14 million code samples and 500 million lines of code in over 55 different languages, including C, Java, Go, Python, COBOL, Pascal, and Fortran. Modern computer programs have millions of lines of code and are hard to debug, maintain, update, and document. The use of artificial intelligence to write codes has been an important area of research for many years. However, it is easier said than done.


IBM's Project CodeNet will test how far you can push AI to write software

#artificialintelligence

IBM's AI research division has released a 14-million-sample dataset to develop machine learning models that can help in programming tasks. Called Project CodeNet, the dataset takes its name after ImageNet, the famous repository of labeled photos that triggered a revolution in computer vision and deep learning. While there's a scant chance that machine learning models built on the CodeNet dataset will make human programmers redundant, there's reason to be hopeful that they will make developers more productive. In the early 2010s, impressive advances in machine learning triggered excitement (and fear) about artificial intelligence soon automating many tasks, including programming. But AI's penetration in software development has been extremely limited.


Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Puri, Ruchir, Kung, David S., Janssen, Geert, Zhang, Wei, Domeniconi, Giacomo, Zolotov, Vladmir, Dolby, Julian, Chen, Jie, Choudhury, Mihir, Decker, Lindsey, Thost, Veronika, Buratti, Luca, Pujar, Saurabh, Finkler, Ulrich

arXiv.org Artificial Intelligence

Advancements in deep learning and machine learning algorithms have enabled breakthrough progress in computer vision, speech recognition, natural language processing and beyond. In addition, over the last several decades, software has been built into the fabric of every aspect of our society. Together, these two trends have generated new interest in the fast-emerging research area of AI for Code. As software development becomes ubiquitous across all industries and code infrastructure of enterprise legacy applications ages, it is more critical than ever to increase software development productivity and modernize legacy applications. Over the last decade, datasets like ImageNet, with its large scale and diversity, have played a pivotal role in algorithmic advancements from computer vision to language and speech understanding. In this paper, we present Project CodeNet, a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for Code. It consists of 14M code samples and about 500M lines of code in 55 different programming languages. Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety programming languages, to advances in code performance (both runtime, and memory) improvement techniques. CodeNet also provides sample input and output test sets for over 7M code samples, which can be critical for determining code equivalence in different languages. As a usability feature, we provide several preprocessing tools in Project CodeNet to transform source codes into representations that can be readily used as inputs into machine learning models.


Can we teach AI how to code? Welcome to IBM's Project CodeNet

#artificialintelligence

IBM's AI research division has released a 14-million-sample dataset to develop machine learning models that can help in programming tasks. Called Project CodeNet, the dataset takes its name after ImageNet, the famous repository of labeled photos that triggered a revolution in computer vision and deep learning. While there's a scant chance that machine learning models built on the CodeNet dataset will make human programmers redundant, there's reason to be hopeful that they will make developers more productive. In the early 2010s, impressive advances in machine learning triggered excitement (and fear) about artificial intelligence soon automating many tasks, including programming. But AI's penetration in software development has been extremely limited.


IBM's Project CodeNet will test how far you can push AI to write software

#artificialintelligence

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence. IBM's AI research division has released a 14-million-sample dataset to develop machine learning models that can help in programming tasks. Called Project CodeNet, the dataset takes its name after ImageNet, the famous repository of labeled photos that triggered a revolution in computer vision and deep learning. While there's a scant chance that machine learning models built on the CodeNet dataset will make human programmers redundant, there's reason to be hopeful that they will make developers more productive. In the early 2010s, impressive advances in machine learning triggered excitement (and fear) about artificial intelligence soon automating many tasks, including programming.


Kickstarting AI for Code: Introducing IBM's Project CodeNet

#artificialintelligence

"Software is eating the world," US entrepreneur Marc Andreessen famously wrote in 2011. Fast-forward to today – software is in financial services and healthcare, smartphones and smart homes. Such large volumes of code, however, is a challenge to debug, maintain, and update, especially as enterprises aim to modernize their aging software infrastructure. As a result, we find ourselves in a new age where it's essential to take advantage of today's powerful technologies like artificial intelligence (AI) and hybrid cloud to create new solutions that can modernize processes across the information technologies (IT) pipeline. A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.


At Think conference, IBM puts AI and hybrid cloud to work - SiliconANGLE

#artificialintelligence

IBM Corp. is pushing the envelope on hybrid cloud and artificial intelligence with a number of key announcements early Tuesday ahead of its Think 2021 event, chiefly aimed at accelerating its customer's digital transformation strategies. One of the main highlights of today's announcements is a new Auto SQL capability within IBM's Cloud Pak for Data offering that automates data access and management without needing to move it first. The company also unveiled a new, AI-based tool for modernizing applications and workloads to run in hybrid cloud environments, plus new AI capabilities in Watson and advancements that should help to scale up quantum computing to more use cases. Available Tuesday, the new AutoSQL capability for IBM Cloud Pak for Data is a big deal because it enables companies to automate access, integration and management of their data no matter where it resides, the company said. IBM said it's addressing one of the most critical pain points customers face as they attempt to reduce the complexity of curating data for AI.


IBM's CodeNet dataset can teach AI to translate computer languages

Engadget

AI and machine learning systems have become increasingly competent in recent years, capable of not just understanding the written word but writing it as well. But while these artificial intelligences have nearly mastered the English language, they have yet to become fluent in the language of computers -- that is, until now. IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code. Over the past decade, advancements in AI have mainly been "driven by deep neural networks, and even that, it was driven by three major factors: data with the availability of large data sets for training, innovations in new algorithms, and the massive acceleration of faster and faster compute hardware driven by GPUs," Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation, likening the new data set to the venerated ImageNet, which has spawned the recent computer vision land rush. "Software is eating the world," Marc Andreessen wrote in 2011.