360Zhinao Technical Report

May-22-2024–arXiv.org Artificial Intelligence

For rapid development in pretraining, we establish a stable and sensitive ablation environment to evaluate and compare experiment runs with minimal model size. We also mainly emphasize data during alignment, where we strive to balance quantity and quality with filtering and reformatting. With tailored data, 360Zhinao-7B's context window is easily extended to 32K and 360K. RMs and RLHF are trained following SFT and credibly applied to specific tasks. All together these contributions lead to 360Zhinao-7B's competitive performance among models of similar size. In recent years, the field of natural language processing (NLP) has witnessed a profound transformation, fueled by the advent of large language models (LLMs) (Bubeck et al., 2023; Touvron et al., 2023a; OpenAI, 2023), which have emerged as a cornerstone to revolutionize the way we understand and generate human language. LLMs represent a new paradigm in artificial intelligence (AI) research, characterized by their immense scale, complexity, and versatility (Zhao et al., 2023). Those models, typically built upon advanced neural network architectures like Transformers, are trained on vast amounts of text data, encompassing billions or even trillions of words. The extensive training endows LLMs with a deep understanding of linguistic structures, nuances, and context, enabling them to generate human-like text and perform a myriad of NLP tasks with unprecedented accuracy and fluency (Yang et al., 2024). Despite the impressive capabilities of LLMs, training an LLM from scratch still struggles with several challenges. The training journey can be divided into two stages: the pretraining stage and the alignment stage (Zhang et al., 2023). The pretraining stage involves the model learning on largescale textual data to build its foundational knowledge and language comprehension. However, two obstacles stick out in the pretraining stage (Zhao et al., 2023). First, refining the training corpus to enhance the base model's performance is paramount given the enormity of pretraining data. While extensive research has delved into data cleaning and sampling methodologies (Soldaini et al., 2024; Penedo et al., 2023; Wenzek et al., 2019; Gunasekar et al., 2023), the sheer scale and intricacy of pretraining datasets still leave ample room for elevating informational density and efficiency. Second, establishing a stable and sensitive ablation environment for accurately assessing data strategies poses another challenge (Chang et al., 2024; Zhou et al., 2023).

arxiv preprint arxiv, deduplication, language model, (15 more...)

arXiv.org Artificial Intelligence

May-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > San Francisco County > San Francisco (0.04)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Myanmar
  - Tanintharyi Region > Dawei (0.04)

Genre:
- Research Report (0.50)

Industry:
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found