Goto

Collaborating Authors

 malware


A Proof of Proposition 2.2: additive expansion proposition

Neural Information Processing Systems

We first define the restricted Cheeger constant in the link prediction task. Then, according to Proposition 2.1, we have: Then, we can draw the same conclusion with Eq.12, and the Thus, Eq.16 can be simplified to: "sites" Based on the Eq.15 and Eq.17, we can rewrite L The inequality holds due to the assumption. Knowledge discovery: In the 5 random experiments, we add 500 pseudo links in each iteration. The metadata information of the nodes are all strongly relevant to "Linux" Both papers focus on the "malware"/"phishing" under the topic "Computer security". The detailed result of the case study is shown in Table 6.


AI is already making online swindles easier. It could get much worse.

MIT Technology Review

AI is already making online swindles easier. It could get much worse. Some cybersecurity researchers say it's too early to worry about AI-orchestrated cyberattacks. Others say it could already be happening. Anton Cherepanov is always on the lookout for something interesting. And in late August last year, he spotted just that.


149 Million Usernames and Passwords Exposed by Unsecured Database

WIRED

This "dream wish list for criminals" includes millions of Gmail, Facebook, banking logins, and more. The researcher who discovered it suspects they were collected using infostealing malware. A database containing 149 million account usernames and passwords--including 48 million for Gmail, 17 million for Facebook, and 420,000 for the cryptocurrency platform Binance --has been removed after a researcher reported the exposure to the hosting provider. The longtime security analyst who discovered the database, Jeremiah Fowler, could not find indications of who owned or operated it, so he worked to notify the host, which took down the trove because it violated a terms of service agreement. In addition to email and social media logins for a number of platforms, Fowler also observed credentials for government systems from multiple countries as well as consumer banking and credit card logins and media streaming platforms.


Comparative Analysis of Hash-based Malware Clustering via K-Means

Thein, Aink Acrie Soe, Pitropakis, Nikolaos, Papadopoulos, Pavlos, Grierson, Sam, Jan, Sana Ullah

arXiv.org Artificial Intelligence

With the adoption of multiple digital devices in everyday life, the cyber-attack surface has increased. Adversaries are continuously exploring new avenues to exploit them and deploy malware. On the other hand, detection approaches typically employ hashing-based algorithms such as SSDeep, TLSH, and IMPHash to capture structural and behavioural similarities among binaries. This work focuses on the analysis and evaluation of these techniques for clustering malware samples using the K-means algorithm. More specifically, we experimented with established malware families and traits and found that TLSH and IMPHash produce more distinct, semantically meaningful clusters, whereas SSDeep is more efficient for broader classification tasks. The findings of this work can guide the development of more robust threat-detection mechanisms and adaptive security mechanisms.


Hackers tricked ChatGPT, Grok and Google into helping them install malware

Engadget

GPU prices could follow RAM's big rise Using popular AI chatbots, attackers created search-friendly links that instructed a user to hack their own device. Ever since reporting earlier this year on how easy it is to trick an agentic browser, I've been following the intersections between modern AI and old-school scams. Now, there's a new convergence on the horizon: hackers are apparently using AI prompts to seed Google search results with dangerous commands. When executed by unknowing users, these commands prompt computers to give the hackers the access they need to install malware. The warning comes by way of a recent report from detection-and-response firm Huntress.


Clustering Malware at Scale: A First Full-Benchmark Study

Mocko, Martin, Ševcech, Jakub, Chudá, Daniela

arXiv.org Artificial Intelligence

Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets. Ours is the first study of malware clustering performed on whole malware benchmark datasets. Additionally, we extend the malware clustering task by incorporating benign samples. Our results indicate that incorporating benign samples does not significantly degrade clustering quality. We find that there are differences in the quality of the created clusters between Ember and Bodmas, as well as a private industry dataset. Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.


MASCOT: Analyzing Malware Evolution Through A Well-Curated Source Code Dataset

Li, Bojing, Zhong, Duo, Nadendla, Dharani, Terceros, Gabriel, Bhandar, Prajna, S, Raguvir, Nicholas, Charles

arXiv.org Artificial Intelligence

Abstract--In recent years, the explosion of malware and extensive code reuse have formed complex evolutionary connections among malware specimens. The rapid pace of development makes it challenging for existing studies to characterize recent evolutionary trends. In addition, intuitive tools to untangle these intricate connections between malware specimens or categories are urgently needed. This paper introduces a manually-reviewed malware source code dataset containing 6032 specimens. Building on and extending current research from a software engineering perspective, we systematically evaluate the scale, development costs, code quality, as well as security and dependencies of modern malware. We further introduce a multi-view genealogy analysis to clarify malware connections: at an overall view, this analysis quantifies the strength and direction of connections among specimens and categories; at a detailed view, it traces the evolutionary histories of individual specimens. Experimental results indicate that, despite persistent shortcomings in code quality, malware specimens exhibit an increasing complexity and standardization, in step with the development of mainstream software engineering practices. Meanwhile, our genealogy analysis intuitively reveals lineage expansion and evolution driven by code reuse, providing new evidence and tools for understanding the formation and evolution of the malware ecosystem. With the rapid development of information technology and large language models, malware has experienced a surge in recent years, exhibiting strong connections among categories and specimens, as well as high code reuse rates [1]. In the past 12 months, more than 107 million new malicious or potentially unwanted applications were detected [2], [3]. Many of these malware specimens are variants of previously known malware, which indicates the prevalence of code reuse and family-oriented evolution. However, the difficulty of collecting, reviewing, and labeling has resulted in a scarcity of source code datasets [4]. Existing datasets lack human curation, reliable labels, and timestamps.


Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

Bommarito, Michael J. II

arXiv.org Artificial Intelligence

Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.


Google issues warning on fake VPN apps

FOX News

Google warns Android users about fake VPN apps containing malware including info stealers, banking trojans and remote access tools designed to steal personal data.


Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

Raff, Edward, Curtin, Ryan R., Everett, Derek, Joyce, Robert J., Holt, James

arXiv.org Artificial Intelligence

A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.