Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Yu, Yao-Ching, Chiang, Tsun-Han, Tsai, Cheng-Wei, Huang, Chien-Ming, Tsao, Wen-Kwang

Feb-16-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Thailand
  - Bangkok > Bangkok (0.04)
- Europe
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - United Kingdom > Scotland
    - City of Edinburgh > Edinburgh (0.04)
- North America > United States
  - Florida > Miami-Dade County > Miami (0.04)
- South America > Uruguay
  - Maldonado > Maldonado (0.04)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Government > Military
  - Cyberwarfare (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.97)
    - Natural Language > Large Language Model (1.00)
  - Security & Privacy (1.00)