The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Katzy, Jonathan, Popescu, Razvan Mihai, van Deursen, Arie, Izadi, Maliheh

arXiv.org Artificial Intelligence 

To cover more specific use cases, we also include The data-intensive training process of Large Language Models domain-specific languages such as Mathematica, Emacs-Lisp, (LLMs) has driven the release of numerous large-scale and Coq. A complete list of all languages included in the datasets, particularly for code, to facilitate the development dataset is presented in Table I. of new models. This rapid increase in the amount of training B. Query data used to pre-train LLMs has resulted in extensive datasets covering almost all publicly available code [1]-[3]. We focus on repositories that have one of the targeted To assess the success of such LLMs in downstream tasks, languages as the main language of the repository. We further fresh data not seen during training is needed. Otherwise such select only repositories that are licensed under non-permissive evaluations are contaminated, possibly resulting in overly optimistic licenses. We choose non-permissive licenses as an initial filter results. Unfortunately, obtaining such non-contaminated for repositories, as many large-scale datasets focus on exclusively data is increasingly difficult. In fact, a recent study establishes unlicensed or permissively licensed code [2], [3], [5].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found