Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

Raff, Edward, Curtin, Ryan R., Everett, Derek, Joyce, Robert J., Holt, James

Nov-19-2025–arXiv.org Artificial Intelligence

A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-19-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > New York (0.28)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Data Science > Data Mining (0.94)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found