AITopics | file type

Collaborating Authors

file type

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

Nasr, Milad, Fratantonio, Yanick, Invernizzi, Luca, Albertini, Ange, Farah, Loua, Petit-Bianco, Alex, Terzis, Andreas, Thomas, Kurt, Bursztein, Elie, Carlini, Nicholas

arXiv.org Artificial IntelligenceOct-3-2025

As deep learning models become widely deployed as components within larger production systems, their individual shortcomings can create system-level vulnerabilities with real-world impact. This paper studies how adversarial attacks targeting an ML component can degrade or bypass an entire production-grade malware detection system, performing a case study analysis of Gmail's pipeline where file-type identification relies on a ML model. The malware detection pipeline in use by Gmail contains a machine learning model that routes each potential malware sample to a specialized malware classifier to improve accuracy and performance. This model, called Magika, has been open sourced. By designing adversarial examples that fool Magika, we can cause the production malware service to incorrectly route malware to an unsuitable malware detector thereby increasing our chance of evading detection. Specifically, by changing just 13 bytes of a malware sample, we can successfully evade Magika in 90% of cases and thereby allow us to send malware files over Gmail. We then turn our attention to defenses, and develop an approach to mitigate the severity of these types of attacks. For our defended production model, a highly resourced adversary requires 50 bytes to achieve just a 20% attack success rate. We implement this defense, and, thanks to a collaboration with Google engineers, it has already been deployed in production for the Gmail classifier.

adversarial example, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.01676

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MMORE: Massive Multimodal Open RAG & Extraction

Sallinen, Alexandre, Krsteski, Stefan, Teiletche, Paul, Allard, Marc-Antoine, Lecoeur, Baptiste, Zhang, Michael, Nemo, Fabrice, Kalajdzic, David, Meyer, Matthias, Hartley, Mary-Anne

arXiv.org Artificial IntelligenceSep-16-2025

We introduce MMORE, an open-source pipeline for Massive Multimodal Open RetrievalAugmented Generation and Extraction, designed to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale. MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format to enable downstream applications for LLMs. The architecture offers modular, distributed processing, enabling scalable parallelization across CPUs and GPUs. On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs. The pipeline integrates hybrid dense-sparse retrieval and supports both interactive APIs and batch RAG endpoints. Evaluated on PubMedQA, MMORE-augmented medical LLMs improve biomedical QA accuracy with increasing retrieval depth. MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems on diverse, real-world multimodal data. The codebase is available at https://github.com/swiss-ai/mmore.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.11937

Country: Europe (0.28)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables

Simmering, Paul F., Schulz, Benedikt, Tabino, Oliver, Wittenburg, Georg

arXiv.org Artificial IntelligenceAug-25-2025

As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also "read" by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.

large language model, layout element, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2508.15817

Genre: Research Report > New Finding (0.48)

Industry: Marketing (0.62)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Study of the influence of a biased database on the prediction of standard algorithms for selecting the best candidate for an interview

Wang, Shuyu, Saillet, Angélique, Gall, Philomène Le, Lacroux, Alain, Martin-Lacroux, Christelle, Brault, Vincent

arXiv.org Artificial IntelligenceMay-6-2025

Artificial Intelligence (AI) is extensively used across various stages of the recruitment process, from automated candidate sourcing on social media platforms to asynchronous video recruitment methods. A study of Human Resources (HR) professionals representing 500 mid-sized organisations from diverse industries across five countries revealed that 24% of businesses have already implemented AI for recruitment purposes, while 56% of hiring managers plan to adopt it within the next year [Sage, 2020]. AI is employed to augment human decision-making regarding job candidates (such as determining who should receive a job offer) and to support the actions of human decision-makers throughout the process (such as data collection and analysis; Gonzalez, Liu, Shirase, Tomczak, Lobbe, Justenhoven, and Martin [2022]). Some applications incorporating AI algorithms are widely accepted and relatively uncontroversial.

artificial intelligence, classification, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.02609

Country:

Europe > France (0.29)
Europe > Austria (0.28)

Genre: Research Report (1.00)

Industry: Law > Civil Rights & Constitutional Law (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.48)

Add feedback

A Proposed Large Language Model-Based Smart Search for Archive System

Nguyen, Ha Dung, Nguyen, Thi-Hoang Anh, Nguyen, Thanh Binh

arXiv.org Artificial IntelligenceJan-12-2025

This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.07024

Country:

North America (0.28)
Asia > Vietnam (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.35)

Add feedback

Combining AI and AM - Improving Approximate Matching through Transformer Networks

Uhlig, Frieder, Struppek, Lukas, Hintersdorf, Dominik, Göbel, Thomas, Baier, Harald, Kersting, Kristian

arXiv.org Artificial IntelligenceApr-27-2023

Approximate matching (AM) is a concept in digital forensics to determine the similarity between digital artifacts. An important use case of AM is the reliable and efficient detection of case-relevant data structures on a blacklist, if only fragments of the original are available. For instance, if only a cluster of indexed malware is still present during the digital forensic investigation, the AM algorithm shall be able to assign the fragment to the blacklisted malware. However, traditional AM functions like TLSH and ssdeep fail to detect files based on their fragments if the presented piece is relatively small compared to the overall file size. A second well-known issue with traditional AM algorithms is the lack of scaling due to the ever-increasing lookup databases. We propose an improved matching algorithm based on transformer models from the field of natural language processing. We call our approach Deep Learning Approximate Matching (DLAM). As a concept from artificial intelligence (AI), DLAM gets knowledge of characteristic blacklisted patterns during its training phase. Then DLAM is able to detect the patterns in a typically much larger file, that is DLAM focuses on the use case of fragment detection. We reveal that DLAM has three key advantages compared to the prominent conventional approaches TLSH and ssdeep. First, it makes the tedious extraction of known to be bad parts obsolete, which is necessary until now before any search for them with AM algorithms. This allows efficient classification of files on a much larger scale, which is important due to exponentially increasing data to be investigated. Second, depending on the use case, DLAM achieves a similar or even significantly higher accuracy in recovering fragments of blacklisted files. Third, we show that DLAM enables the detection of file correlations in the output of TLSH and ssdeep even for small fragment sizes.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2208.11367

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lightroom adds AI denoise to make old photos look like new

PCWorldApr-18-2023, 16:42:08 GMT

Modern smartphone cameras have become better and better at dealing with low light, and their multi-megapixel image sensors routinely take sharp shots in full sun. But it's likely that you have hundreds of grainy older photos from older cameras--and that's where a new AI feature from Adobe Lightroom may prove useful. Adobe Lightroom Classic, as well as the modern versions for Windows and Mac and Adobe Camera Raw, are adding a Denoise feature. Professional photographers may benefit from how Adobe sees it--taking high ISO shots in low light. Doing so introduces the "speckles" where an image sensor struggles to present the image clearly, due to issues like pixel density, the size of the sensor, and shutter speed.

adobe, lightroom, make old photo look, (10 more...)

PCWorld

Industry: Media > Photography (0.97)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Mobile (0.37)

Add feedback

Adversarial Networks and Machine Learning for File Classification

Germain, Ken St., Angichiodo, Josh

arXiv.org Artificial IntelligenceFeb-2-2023

Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).

artificial intelligence, file type, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2301.11964

Country:

North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland > Anne Arundel County > Annapolis (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.93)
Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection

Bridges, Robert A., Oesch, Sean, Verma, Miki E., Iannacone, Michael D., Huffer, Kelly M. T., Jewell, Brian, Nichols, Jeff A., Weber, Brian, Beaver, Justin M., Smith, Jared M., Scofield, Daniel, Miles, Craig, Plummer, Thomas, Daniell, Mark, Tall, Anne M.

arXiv.org Artificial IntelligenceAug-17-2022

Attackers use malicious software, known as malware, to steal sensitive data, damage network infrastructure, and hold information for ransom. One of the top priorities for computer security tools is to detect malware and prevent or minimize its impact on both corporate and personal networks. Traditionally, signature-based methods have been used to detect files previously identified as malicious with near perfect precision, but potentially miss newer malware samples. With the advent of self-modifying malware and the rapid increase in novel threats, signature-based methods are insufficient on their own. By generalizing patterns of known benign/malicious training examples, machine learning (ML) exhibits the capability to quickly and accurately classify novel file samples in many research studies [19]. Moreover, ML-based malware research has made the transition from the subject of myriad research efforts to a current mainstay of commercial-off-the-shelf (COTS) malware detectors. Yet, few practical evaluations of COTS ML-based technologies have been conducted. Turning from the academic literature to market reports from commercial companies can provide (for a fee) useful information, specifically, end-user feedback, itemization of all technologies in the antivirus/endpoint detection and response marketplace [17], and even statistics showing the efficacy of the detectors on malware tests [4, 40].

artificial intelligence, detector, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3567432

2012.09214

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.14)
North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
North America > United States > Colorado (0.04)

Genre:

Research Report > New Finding (1.00)
Overview (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)

Add feedback

The Best Data Collection Tools for Machine Learning Lionbridge AI

#artificialintelligenceFeb-9-2020, 23:39:17 GMT

Data collection is the single most important step in solving any machine learning problem. As such, teams that dive head first into projects without considering the right data collection process often don't get the results they want. Fortunately, there are many data collection tools to help prepare training datasets quickly and at scale. The best data collection tools are easy to use, support a range of functionalities and file types, and preserve the overall integrity of data. In this article, we outline the best data collection tools for machine learning projects.

data collection tool, machine learning lionbridge ai, platform, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback