Goto

Collaborating Authors

 file type


Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Nasr, Milad, Fratantonio, Yanick, Invernizzi, Luca, Albertini, Ange, Farah, Loua, Petit-Bianco, Alex, Terzis, Andreas, Thomas, Kurt, Bursztein, Elie, Carlini, Nicholas

arXiv.org Artificial Intelligence

As deep learning models become widely deployed as components within larger production systems, their individual shortcomings can create system-level vulnerabilities with real-world impact. This paper studies how adversarial attacks targeting an ML component can degrade or bypass an entire production-grade malware detection system, performing a case study analysis of Gmail's pipeline where file-type identification relies on a ML model. The malware detection pipeline in use by Gmail contains a machine learning model that routes each potential malware sample to a specialized malware classifier to improve accuracy and performance. This model, called Magika, has been open sourced. By designing adversarial examples that fool Magika, we can cause the production malware service to incorrectly route malware to an unsuitable malware detector thereby increasing our chance of evading detection. Specifically, by changing just 13 bytes of a malware sample, we can successfully evade Magika in 90% of cases and thereby allow us to send malware files over Gmail. We then turn our attention to defenses, and develop an approach to mitigate the severity of these types of attacks. For our defended production model, a highly resourced adversary requires 50 bytes to achieve just a 20% attack success rate. We implement this defense, and, thanks to a collaboration with Google engineers, it has already been deployed in production for the Gmail classifier.


MMORE: Massive Multimodal Open RAG & Extraction

Sallinen, Alexandre, Krsteski, Stefan, Teiletche, Paul, Allard, Marc-Antoine, Lecoeur, Baptiste, Zhang, Michael, Nemo, Fabrice, Kalajdzic, David, Meyer, Matthias, Hartley, Mary-Anne

arXiv.org Artificial Intelligence

We introduce MMORE, an open-source pipeline for Massive Multimodal Open RetrievalAugmented Generation and Extraction, designed to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale. MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format to enable downstream applications for LLMs. The architecture offers modular, distributed processing, enabling scalable parallelization across CPUs and GPUs. On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs. The pipeline integrates hybrid dense-sparse retrieval and supports both interactive APIs and batch RAG endpoints. Evaluated on PubMedQA, MMORE-augmented medical LLMs improve biomedical QA accuracy with increasing retrieval depth. MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems on diverse, real-world multimodal data. The codebase is available at https://github.com/swiss-ai/mmore.


Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables

Simmering, Paul F., Schulz, Benedikt, Tabino, Oliver, Wittenburg, Georg

arXiv.org Artificial Intelligence

As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also "read" by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.


Study of the influence of a biased database on the prediction of standard algorithms for selecting the best candidate for an interview

Wang, Shuyu, Saillet, Angélique, Gall, Philomène Le, Lacroux, Alain, Martin-Lacroux, Christelle, Brault, Vincent

arXiv.org Artificial Intelligence

Artificial Intelligence (AI) is extensively used across various stages of the recruitment process, from automated candidate sourcing on social media platforms to asynchronous video recruitment methods. A study of Human Resources (HR) professionals representing 500 mid-sized organisations from diverse industries across five countries revealed that 24% of businesses have already implemented AI for recruitment purposes, while 56% of hiring managers plan to adopt it within the next year [Sage, 2020]. AI is employed to augment human decision-making regarding job candidates (such as determining who should receive a job offer) and to support the actions of human decision-makers throughout the process (such as data collection and analysis; Gonzalez, Liu, Shirase, Tomczak, Lobbe, Justenhoven, and Martin [2022]). Some applications incorporating AI algorithms are widely accepted and relatively uncontroversial.


A Proposed Large Language Model-Based Smart Search for Archive System

Nguyen, Ha Dung, Nguyen, Thi-Hoang Anh, Nguyen, Thanh Binh

arXiv.org Artificial Intelligence

This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.


Combining AI and AM - Improving Approximate Matching through Transformer Networks

Uhlig, Frieder, Struppek, Lukas, Hintersdorf, Dominik, Göbel, Thomas, Baier, Harald, Kersting, Kristian

arXiv.org Artificial Intelligence

Approximate matching (AM) is a concept in digital forensics to determine the similarity between digital artifacts. An important use case of AM is the reliable and efficient detection of case-relevant data structures on a blacklist, if only fragments of the original are available. For instance, if only a cluster of indexed malware is still present during the digital forensic investigation, the AM algorithm shall be able to assign the fragment to the blacklisted malware. However, traditional AM functions like TLSH and ssdeep fail to detect files based on their fragments if the presented piece is relatively small compared to the overall file size. A second well-known issue with traditional AM algorithms is the lack of scaling due to the ever-increasing lookup databases. We propose an improved matching algorithm based on transformer models from the field of natural language processing. We call our approach Deep Learning Approximate Matching (DLAM). As a concept from artificial intelligence (AI), DLAM gets knowledge of characteristic blacklisted patterns during its training phase. Then DLAM is able to detect the patterns in a typically much larger file, that is DLAM focuses on the use case of fragment detection. We reveal that DLAM has three key advantages compared to the prominent conventional approaches TLSH and ssdeep. First, it makes the tedious extraction of known to be bad parts obsolete, which is necessary until now before any search for them with AM algorithms. This allows efficient classification of files on a much larger scale, which is important due to exponentially increasing data to be investigated. Second, depending on the use case, DLAM achieves a similar or even significantly higher accuracy in recovering fragments of blacklisted files. Third, we show that DLAM enables the detection of file correlations in the output of TLSH and ssdeep even for small fragment sizes.


Lightroom adds AI denoise to make old photos look like new

PCWorld

Modern smartphone cameras have become better and better at dealing with low light, and their multi-megapixel image sensors routinely take sharp shots in full sun. But it's likely that you have hundreds of grainy older photos from older cameras--and that's where a new AI feature from Adobe Lightroom may prove useful. Adobe Lightroom Classic, as well as the modern versions for Windows and Mac and Adobe Camera Raw, are adding a Denoise feature. Professional photographers may benefit from how Adobe sees it--taking high ISO shots in low light. Doing so introduces the "speckles" where an image sensor struggles to present the image clearly, due to issues like pixel density, the size of the sensor, and shutter speed.


Adversarial Networks and Machine Learning for File Classification

Germain, Ken St., Angichiodo, Josh

arXiv.org Artificial Intelligence

Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).


Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection

Bridges, Robert A., Oesch, Sean, Verma, Miki E., Iannacone, Michael D., Huffer, Kelly M. T., Jewell, Brian, Nichols, Jeff A., Weber, Brian, Beaver, Justin M., Smith, Jared M., Scofield, Daniel, Miles, Craig, Plummer, Thomas, Daniell, Mark, Tall, Anne M.

arXiv.org Artificial Intelligence

Attackers use malicious software, known as malware, to steal sensitive data, damage network infrastructure, and hold information for ransom. One of the top priorities for computer security tools is to detect malware and prevent or minimize its impact on both corporate and personal networks. Traditionally, signature-based methods have been used to detect files previously identified as malicious with near perfect precision, but potentially miss newer malware samples. With the advent of self-modifying malware and the rapid increase in novel threats, signature-based methods are insufficient on their own. By generalizing patterns of known benign/malicious training examples, machine learning (ML) exhibits the capability to quickly and accurately classify novel file samples in many research studies [19]. Moreover, ML-based malware research has made the transition from the subject of myriad research efforts to a current mainstay of commercial-off-the-shelf (COTS) malware detectors. Yet, few practical evaluations of COTS ML-based technologies have been conducted. Turning from the academic literature to market reports from commercial companies can provide (for a fee) useful information, specifically, end-user feedback, itemization of all technologies in the antivirus/endpoint detection and response marketplace [17], and even statistics showing the efficacy of the detectors on malware tests [4, 40].


Translate All: Automating multiple file type batch translation with AWS CloudFormation

#artificialintelligence

This is a guest post by Cyrus Wong, an AWS Machine Learning Hero. You can learn more about and connect with AWS Machine Learning Heroes at the community page. On July 29, 2020, AWS announced that Amazon Translate now supports Microsoft Office documents, including .docx, The world is full of bilingual countries and cities like Hong Kong. I find myself always needing to prepare Office documents and presentation slides in both English and Chinese.