Goto

Collaborating Authors

 Mondal, Rajab Ali


Narrow Transformer: Starcoder-Based Java-LM For Desktop

arXiv.org Artificial Intelligence

The state-of-the-art code models, capable of understanding and generating code in numerous programming languages, are revolutionizing the way enterprises approach software development. With the ability to understand and generate code across a vast array of programming languages, these code models offer a significant boost in productivity. However, the one-size-fits-all approach of these generic multi-lingual code models often falls short in meeting the nuanced requirements of project-level coding tasks in an enterprise, which tend to be language-specific. This has led to the development of Narrow Transformers (NTs), specialized models further trained on a particular programming language, offering a more efficient solution for enterprises. These NTs are designed to optimize performance for a specific programming language, balancing the trade-offs between model size, inferencing cost, and operational throughput. As demand for tailored solutions grows, we can expect a surge in NT development, providing the precision and efficiency required by enterprise projects. However, in practice, the substantial economic cost associated with training and fine-tuning large code models renders language model experiments prohibitively expensive for most researchers and organizations.


EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search

arXiv.org Artificial Intelligence

In the context of enterprises accumulating proprietary unstructured data, AI-driven information retrieval solutions have emerged as vital tools for extracting relevant answers to employee queries. Traditional methods for developing such solutions often involve choosing between Retrieval Augmented Generation (RAG) or fine-tuned Large Language Models (LLMs). However, fine-tuned LLMs, comprising only generative models, lack a guarantee of factual accuracy, while RAG, comprising an embedding model and a generative model, assures factual precision (Lewis at al., 2020 [1]). Despite their superior performance in general, RAG based solutions often rely on pre-trained models, potentially leading to suboptimal alignment with enterprise-specific data. Addressing this challenge entails exploring two potential avenues: Firstly, recent studies such as RAFT (Zhang et al., 2024 [2]) explore the integration of fine-tuned generative models within a RAG pipeline to enhance accuracy, albeit requiring substantial domain-specific data to fine-tune the generative models. Alternatively, leveraging domain-specific embedding models within a RAG pipeline to enhance accuracy remains an underexplored area. Earlier efforts, such as BioBERT (Lee et al., 2019 [3]), SciBERT (Beltagy et al., 2019 [4]), and LEGAL-BERT (Chalkidis et al., 2020 [5]) have effectively demonstrated the efficacy of domain-specific embeddings in information retrieval tasks. These endeavors primarily investigated two methodologies: (a) extending the pre-training of BERT and (b) pre-training BERT from scratch, both employing domain-specific corpora. Despite yielding commendable results, these methodologies necessitated substantial domainspecific corpora, with figures as staggering as 21.3B words for BioBERT, 3.17B tokens for SciBERT, and 11.5GB of text data for LEGAL-BERT, thereby posing significant challenges, particularly in low-resource domains like enterprises.