Brief Review -- Codex: Evaluating Large Language Models Trained on Code

Apr-1-2023, 11:15:25 GMT–#artificialintelligence

The training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. Authors filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters. After filtering, the final dataset totaled 159 GB. The training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. Authors filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.

codex, language model, line length, (15 more...)

#artificialintelligence

Apr-1-2023, 11:15:25 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.95)
  - Natural Language > Large Language Model (0.77)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found