Text Quality-Based Pruning for Efficient Training of Language Models

Sharma, Vasu, Padthe, Karthik, Ardalani, Newsha, Tirumala, Kushal, Howes, Russell, Xu, Hu, Huang, Po-Yao, Li, Shang-Wen, Aghajanyan, Armen, Ghosh, Gargi, Zettlemoyer, Luke

May-10-2024–arXiv.org Artificial Intelligence

By leveraging attention in recent years due to their impressive this numerical text quality score, we demonstrate performance in various natural language processing how it can be used to prune the original dataset, (NLP) tasks (Zhang et al., 2022; Penedo et al., enabling the training of LMs using only a fraction 2023; Touvron et al., 2023; Zhou et al., 2023; Liu of the data. Our approach aims to identify et al., 2019). However, their training process often and eliminate low-quality text instances, thereby relies on computationally intensive procedures that streamlining the training process and mitigating the involve massive datasets and compute requirements burden of handling large-scale datasets. We also remove which hinders training large scale LMs on noisy potentially harmful content from the data by real-world or domain specific datasets. What's ensuring that harmful content is rated poorly by our worse is that several of these datasets are uncurated text quality score which can then be pruned. We and may contain harmful content which the observe an absolute improvement of 0.9% averaged LM model can potentially pick up during the training over 14 downstream evaluation tasks for multiple process (Deshpande et al., 2023; Schramowski LM models while using 40% lesser data and training et al., 2022; Kuchnik et al., 2023).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

May-10-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)
  - Natural Language
    - Chatbot (0.95)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found