The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models
Katzy, Jonathan, Popescu, Razvan Mihai, van Deursen, Arie, Izadi, Maliheh
–arXiv.org Artificial Intelligence
To cover more specific use cases, we also include The data-intensive training process of Large Language Models domain-specific languages such as Mathematica, Emacs-Lisp, (LLMs) has driven the release of numerous large-scale and Coq. A complete list of all languages included in the datasets, particularly for code, to facilitate the development dataset is presented in Table I. of new models. This rapid increase in the amount of training B. Query data used to pre-train LLMs has resulted in extensive datasets covering almost all publicly available code [1]-[3]. We focus on repositories that have one of the targeted To assess the success of such LLMs in downstream tasks, languages as the main language of the repository. We further fresh data not seen during training is needed. Otherwise such select only repositories that are licensed under non-permissive evaluations are contaminated, possibly resulting in overly optimistic licenses. We choose non-permissive licenses as an initial filter results. Unfortunately, obtaining such non-contaminated for repositories, as many large-scale datasets focus on exclusively data is increasingly difficult. In fact, a recent study establishes unlicensed or permissively licensed code [2], [3], [5].
arXiv.org Artificial Intelligence
Jan-16-2025
- Country:
- North America > United States
- New York > New York County > New York City (0.04)
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Netherlands > South Holland
- Delft (0.06)
- United Kingdom > England
- North America > United States
- Genre:
- Research Report (0.70)
- Technology: