Goto

Collaborating Authors

 open-source code




A First Look at License Compliance Capability of LLMs in Code Generation

arXiv.org Artificial Intelligence

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.


Are ChatGPT and Other Similar Systems the Modern Lernaean Hydras of AI?

arXiv.org Artificial Intelligence

The rise of Generative Artificial Intelligence systems ("AI systems") has created unprecedented social engagement. AI code generation systems provide responses (output) to questions or requests by accessing the vast library of open-source code created by developers over the past few decades. However, they do so by allegedly stealing the open-source code stored in virtual libraries, known as repositories. This Article focuses on how this happens and whether there is a solution that protects innovation and avoids years of litigation. We also touch upon the array of issues raised by the relationship between AI and copyright. Looking ahead, we propose the following: (a) immediate changes to the licenses for open-source code created by developers that will limit access and/or use of any open-source code to humans only; (b) we suggest revisions to the Massachusetts Institute of Technology ("MIT") license so that AI systems are required to procure appropriate licenses from open-source code developers, which we believe will harmonize standards and build social consensus for the benefit of all of humanity, rather than promote profit-driven centers of innovation; (c) we call for urgent legislative action to protect the future of AI systems while also promoting innovation; and (d) we propose a shift in the burden of proof to AI systems in obfuscation cases.


Viral transmission in pedestrian crowds: Coupling an open-source code assessing the risks of airborne contagion with diverse pedestrian dynamics models

arXiv.org Artificial Intelligence

We study viral transmission in crowds via the short-ranged airborne pathway using a purely model-based approach. Our goal is two-pronged. Firstly, we illustrate with a concrete and pedagogical case study how to estimate the risks of new viral infections by coupling pedestrian simulations with the transmission algorithm that we recently released as open-source code. The algorithm hinges on pre-computed viral concentration maps derived from computational fluid dynamics (CFD) simulations. Secondly, we investigate to what extent the transmission risk predictions depend on the pedestrian dynamics model in use. For the simple bidirectional flow under consideration, the predictions are found to be surprisingly stable across initial conditions and models, despite the different microscopic arrangements of the simulated crowd, as long as the crowd evolves in a qualitatively similarly way. On the other hand, when major changes are observed in the crowd's behaviour, notably whenever a jam occurs at the centre of the channel, the estimated risks surge drastically.


Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics

arXiv.org Artificial Intelligence

Intelligent or generative writing tools rely on large language models that recognize, summarize, translate, and predict content. This position paper probes the copyright interests of open data sets used to train large language models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data? We start by defining software copyright and tracing its history. We rely on GitHub Copilot as a modern case study challenging software copyright. Our conclusion outlines obstacles that generative writing assistants create for copyright, and offers a practical road map for copyright analysis for developers, software law experts, and general users to consider in the context of intelligent LLM-powered writing tools.


The (ab)use of Open Source Code to Train Large Language Models

arXiv.org Artificial Intelligence

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.


Machine Learning Is Not Your Copilot: AI System Accused of Violating Open Source Copyright Licenses

#artificialintelligence

As previously reported in this space, the Court of Appeal for the Federal Circuit has ruled that an AI machine cannot be an inventor because it is not a "natural person." You can read those posts here and here. On November 11, 2022, a group of plaintiffs filed suit in the Northern District of California against several defendants, including GitHub, Inc., Microsoft Corporation, and OpenAI, Inc. and related companies to OpenAI. The issue stems from a product called Copilot and a product integrated into Copilot called Codex. To provide some context of the issue, some backstory may help.


How open-source software shapes AI policy

#artificialintelligence

Open-source software quietly affects nearly every issue in AI policy, but it is largely absent from discussions around AI policy--policymakers need to more actively consider OSS's role in AI. Open-source software (OSS), software that is free to access, use, and change without restrictions, plays a central role in the development and use of artificial intelligence (AI). Across open-source programming languages such as Python, R, C, Java, Scala, Javascript, Julia, and others, there are thousands of implementations of machine learning algorithms. OSS frameworks for machine learning, including tidymodels in R and Scikit-learn in Python, have helped consolidate many diverse algorithms into a consistent machine learning process and enabled far easier use for the everyday data scientist. There are also OSS tools specific to the especially important subfield of deep learning, which is dominated by Google's Tensorflow and Facebook's PyTorch.


How secure are your AI and machine learning projects?

#artificialintelligence

When enterprises adopt new technology, security is often on the back burner. It can seem more important to get new products or services to customers and internal users as quickly as possible and at the lowest cost. Good security can be slow and expensive. Artificial intelligence (AI) and machine learning (ML) offer all the same opportunities for vulnerabilities and misconfigurations as earlier technological advances, but they also have unique risks. As enterprises embark on major AI-powered digital transformations, those risks may become greater.