e6c2e85db1f1039177c4495ccd399ac4-Supplemental-Conference.pdf

Apr-30-2026, 03:23:00 GMT–Neural Information Processing Systems

A.1 Preliminary Study2 The basic GPT-2 model1 is trained from scratch on each corpus, which has 12 transformer blocks3 and 12 attention heads with 768 hidden dimensions. The Huggingface transformers [4] and Pytorch4 toolkit [2] are used to train the GPT-2 model in the distributed manner on A100 GPU server. The5 hyper-parameters during training are shown in Table 1.6 Hyper-parameter Value Optimization steps 100K Test interval 10K Dropout rate 0.1 Grad clipping 1.0 Learning rate 5e 5 Batch size 128 Maximum sequence length 256 Warmup steps 10K Learning scheduler Linear decay Random seed 0 Number of GPUs 4 Learning objective Cross-Entropy Loss Table 1: The hyper-parameters during GPT-2 training procedure. Most of the hyper-parameters for our proposed method are the same as that in Table 1 for better8 variable controlling. The specific hyper-parameters for our proposed method are the length of9 repetitive n-gram and its repetition dropout rate p, which are set as 2 and 0.6, respectively.10

category, large language model, machine learning, (11 more...)

Neural Information Processing Systems

Apr-30-2026, 03:23:00 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (0.69)
    - Chatbot (0.69)
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)

Duplicate Docs Excel Report

Title
A Implementation Details 1 A.1 Preliminary Study 2 The basic GPT-2 model

Similar Docs Excel Report more

Title	Similarity	Source
None found