Automatic Binary Dataset Construction for Machine Learning

Neural Information Processing Systems 

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities.