SantaCoder: don't reach for the stars!

Allal, Loubna Ben, Li, Raymond, Kocetkov, Denis, Mou, Chenghao, Akiki, Christopher, Ferrandis, Carlos Munoz, Muennighoff, Niklas, Mishra, Mayank, Gu, Alex, Dey, Manan, Umapathi, Logesh Kumar, Anderson, Carolyn Jane, Zi, Yangtian, Poirier, Joel Lamy, Schoelkopf, Hailey, Troshin, Sergey, Abulkhanov, Dmitry, Romero, Manuel, Lappert, Michael, De Toni, Francesco, del Río, Bernardo García, Liu, Qian, Bose, Shamik, Bhattacharyya, Urvashi, Zhuo, Terry Yue, Yu, Ian, Villegas, Paulo, Zocca, Marco, Mangrulkar, Sourab, Lansky, David, Nguyen, Huu, Contractor, Danish, Villa, Luis, Li, Jia, Bahdanau, Dzmitry, Jernite, Yacine, Hughes, Sean, Fried, Daniel, Guha, Arjun, de Vries, Harm, von Werra, Leandro

arXiv.org Artificial Intelligence 

Corresponding authors (denoted by) can be contacted at contact@bigcode-project.org The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and evaluate them on the MultiPL-E text-to-code benchmark (Cassano et al., 2022). We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode. Over the last two years, we have witnessed tremendous progress in the development of code generating AI assistants (Chen et al., 2021; Chowdhery et al., 2022; Nijkamp et al., 2022; Fried et al., 2022; Li et al., 2022; Athiwaratkun et al., 2022). Machine learning models are now capable of assisting professional developers through the synthesis of novel code snippets, not only from surrounding code fragments, but also from natural language instructions. The models powering these code completion systems are usually referred to as Large Language Models for Code--or code LLMs--and are created by training large transformer neural networks (Vaswani et al., 2017) on big corpora of source code. However, with the exception of a few small-scale efforts (Xu et al., 2022b), there is generally a lack of transparency on the development of code LLMs, in part due to their commercial value and the legal uncertainty around distributing training data and models. Some groups have released model weights (Fried et al., 2022; Nijkamp et al., 2022) or provided access to the model through a paid API service (Chen et al., 2021; Athiwaratkun et al., 2022), but these works did not release the full training data or the preprocessing methods that were used.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found