EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search
Rathinasamy, Kamalkumar, Nettar, Jayarama, Kumar, Amit, Manchanda, Vishal, Vijayakumar, Arun, Kataria, Ayush, Manjunath, Venkateshprasanna, GS, Chidambaram, Sodhi, Jaskirat Singh, Shaikh, Shoeb, Khan, Wasim Akhtar, Singh, Prashant, Ige, Tanishq Dattatray, Tiwari, Vipin, Mondal, Rajab Ali, K, Harshini, Reka, S, Amancharla, Chetana, Rahman, Faiz ur, A, Harikrishnan P, Saha, Indraneel, Tiwary, Bhavya, Patel, Navin Shankar, S, Pradeep T, J, Balaji A, Priyapravas, null, Tarafdar, Mohammed Rafee
–arXiv.org Artificial Intelligence
In the context of enterprises accumulating proprietary unstructured data, AI-driven information retrieval solutions have emerged as vital tools for extracting relevant answers to employee queries. Traditional methods for developing such solutions often involve choosing between Retrieval Augmented Generation (RAG) or fine-tuned Large Language Models (LLMs). However, fine-tuned LLMs, comprising only generative models, lack a guarantee of factual accuracy, while RAG, comprising an embedding model and a generative model, assures factual precision (Lewis at al., 2020 [1]). Despite their superior performance in general, RAG based solutions often rely on pre-trained models, potentially leading to suboptimal alignment with enterprise-specific data. Addressing this challenge entails exploring two potential avenues: Firstly, recent studies such as RAFT (Zhang et al., 2024 [2]) explore the integration of fine-tuned generative models within a RAG pipeline to enhance accuracy, albeit requiring substantial domain-specific data to fine-tune the generative models. Alternatively, leveraging domain-specific embedding models within a RAG pipeline to enhance accuracy remains an underexplored area. Earlier efforts, such as BioBERT (Lee et al., 2019 [3]), SciBERT (Beltagy et al., 2019 [4]), and LEGAL-BERT (Chalkidis et al., 2020 [5]) have effectively demonstrated the efficacy of domain-specific embeddings in information retrieval tasks. These endeavors primarily investigated two methodologies: (a) extending the pre-training of BERT and (b) pre-training BERT from scratch, both employing domain-specific corpora. Despite yielding commendable results, these methodologies necessitated substantial domainspecific corpora, with figures as staggering as 21.3B words for BioBERT, 3.17B tokens for SciBERT, and 11.5GB of text data for LEGAL-BERT, thereby posing significant challenges, particularly in low-resource domains like enterprises.
arXiv.org Artificial Intelligence
May-18-2024
- Country:
- Asia > China (0.14)
- North America > United States (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (0.58)
- Law (0.87)
- Technology: