Task Oriented In-Domain Data Augmentation

Liang, Xiao, Hu, Xinyu, Zuo, Simiao, Gong, Yeyun, Lou, Qiang, Liu, Yi, Huang, Shao-Lun, Jiao, Jian

Jun-24-2024–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown superior performance in various applications and fields. To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data. However, existing approaches suffer from two major issues. First, in-domain data are scarce compared to general domain-agnostic data. Second, data used for continual pre-training are not task-aware, such that they may not be helpful to downstream applications. We propose TRAIT, a task-oriented in-domain data augmentation framework. Our framework is divided into two parts: in-domain data selection and task-oriented synthetic passage generation. The data selection strategy identifies and selects a large amount of in-domain data from general corpora, and thus significantly enriches domain knowledge in the continual pre-training data. The synthetic passages contain guidance on how to use domain knowledge to answer questions about downstream tasks. We adapt LLMs to two domains: advertisement and math. On average, TRAIT improves LLM performance by 8% in the advertisement domain and 7.5% in the math domain. Large language models (LLMs) have achieved significant performance improvements in various applications such as language modeling (Brown et al., 2020; Touvron et al., 2023; Chowdhery et al., 2023) and visual understanding (Radford et al., 2021). They have also shown superior performance in fields such as finance (Xie et al., 2023b), e-commerce (Ma et al., 2023) and healthcare (Bakhshandeh, 2023). However, the models are usually trained on a large amount of general domain-agnostic data, such as web corpora. Because of the lack of domain-specific training, LLMs suffer from subpar performance when directly applied to certain domains such as advertisement. To adapt LLMs to a specific domain, continual pre-training methods (Gururangan et al., 2020) are commonly applied. In particular, the LLM is continual pre-trained on in-domain corpora, such that it can acquire domain knowledge and better adapt to downstream tasks.

arxiv preprint arxiv, downstream task, in-domain data, (14 more...)

arXiv.org Artificial Intelligence

Jun-24-2024

arXiv.org PDF

Add feedback

Country:
- Africa > Nigeria (0.04)
- Asia > China
  - Guangxi Province > Nanning (0.04)
- Europe (0.04)
- North America > United States
  - California > San Diego County
    - Encinitas (0.04)
    - Escondido (0.04)
    - San Diego (0.04)
- Oceania > Australia (0.04)
- South America > Peru (0.04)

Genre:
- Research Report (0.82)

Industry:
- Education
  - Educational Setting > Online (1.00)
  - Educational Technology > Educational Software
    - Computer Based Training (0.67)
- Health & Medicine (1.00)
- Retail (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found