Goto

Collaborating Authors

 ponse


Towards a rigorous evaluation of RAG systems: the challenge of due diligence

Martinon, Grégoire, de Brionne, Alexandra Lorenzo, Bohard, Jérôme, Lojou, Antoine, Hervault, Damien, Brunel, Nicolas J-B.

arXiv.org Artificial Intelligence

The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further analysis. Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications.


O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine

Rousseau, Ismaël, Perroux, Claire, Adam, Pierre, Girault, Thomas, Delphin-Poulat, Lionel, Veyret, Morgan, Lecorvé, Gwénolé, Damnati, Géraldine

arXiv.org Artificial Intelligence

This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. -- Ce document présente les travaux réalisés par l'équipe O_FT conjointe à Orange et Ouest-France sur l'adaptation de modèles de langue au domaine de la défense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentrés sur l'adaptation du modèle \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pré-entraînement et d'affinage sur instructions. L'essentiel de nos travaux a porté sur la constitution, génération et sélection de données pour ces deux étapes ainsi que pour l'évaluation des modèles. Les expériences montrent que nos modèles adaptés ont de meilleures de connaissances de fond et une meilleure capacité de traitement de tâches sur le domaine de la défense, ainsi que des performances comparables (voire supérieures) sur des connaissances ou capacités généralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux démontrent ainsi la viabilité de l'adaptation à un domaine de modèles relativement petits.


\'Evaluation des capacit\'es de r\'eponse de larges mod\`eles de langage (LLM) pour des questions d'historiens

Chartier, Mathieu, Dakkoune, Nabil, Bourgeois, Guillaume, Jean, Stéphane

arXiv.org Artificial Intelligence

Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.


Modelisation de l'incertitude et de l'imprecision de donnees de crowdsourcing : MONITOR

Thierry, Constance, Dubois, Jean-Christophe, Gall, Yolande Le, Martin, Arnaud

arXiv.org Artificial Intelligence

Crowdsourcing is defined as the outsourcing of tasks to a crowd of contributors. The crowd is very diverse on these platforms and includes malicious contributors attracted by the remuneration of tasks and not conscientiously performing them. It is essential to identify these contributors in order to avoid considering their responses. As not all contributors have the same aptitude for a task, it seems appropriate to give weight to their answers according to their qualifications. This paper, published at the ICTAI 2019 conference, proposes a method, MONITOR, for estimating the profile of the contributor and aggregating the responses using belief function theory.


Contributors profile modelization in crowdsourcing platforms

Thierry, Constance, Dubois, Jean-Christophe, Gall, Yolande Le, Martin, Arnaud

arXiv.org Artificial Intelligence

The crowdsourcing consists in the externalisation of tasks to a crowd of people remunerated to execute this ones. The crowd, usually diversified, can include users without qualification and/or motivation for the tasks. In this paper we will introduce a new method of user expertise modelization in the crowdsourcing platforms based on the theory of belief functions in order to identify serious and qualificated users.


Outrepasser les limites des techniques classiques de Prise d'Empreintes grace aux Reseaux de Neurones

Burroni, Javier, Sarraute, Carlos

arXiv.org Artificial Intelligence

We present an application of Artificial Intelligence techniques to the field of Information Security. The problem of remote Operating System (OS) Detection, also called OS Fingerprinting, is a crucial step of the penetration testing process, since the attacker (hacker or security professional) needs to know the OS of the target host in order to choose the exploits that he will use. OS Detection is accomplished by passively sniffing network packets and actively sending test packets to the target host, to study specific variations in the host responses revealing information about its operating system. The first fingerprinting implementations were based on the analysis of differences between TCP/IP stack implementations. The next generation focused the analysis on application layer data such as the DCE RPC endpoint information. Even though more information was analyzed, some variation of the "best fit" algorithm was still used to interpret this new information. Our new approach involves an analysis of the composition of the information collected during the OS identification process to identify key elements and their relations. To implement this approach, we have developed tools using Neural Networks and techniques from the field of Statistics. These tools have been successfully integrated in a commercial software (Core Impact).