Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer
Kahana, Adar, Mathew, Jaya Susan, Bleik, Said, Reynolds, Jeremy, Elisha, Oren
–arXiv.org Artificial Intelligence
With the publication of the paper, 'Attention is All You Need' [1], transformer architecture and attention mechanism has made way for a plethora of Large Language Models (LLMs). More recently with the launch of ChatGPT (Chat Generative Pre-trained Transformer) [2], there has been a growing interest amongst the general public as well as in large businesses in using these LLMs in improving their efficiency [3] in various common scenarios like summarizing a document, answering a question, solving a mathematics problem to even writing code. Majority of these LLMs are pre-trained using predominantly datasets in English and some high resource languages [4] [5], hence tend to perform best in English and in these high resource languages but tend to degrade in their performance in other especially low resource languages like some of the languages spoken in Asia and Africa [6]. However, these high resource languages do not necessarily account for majority of the global population. To enable widespread adoption of these LLMs around the world we would need to ensure that these models can support multiple languages in addition to the population who understand and can converse in English or these high resource languages [7] [8]. In addition, businesses and organizations are looking to using these models on a global scale to cater to their consumers around their world in the language of their choice [9]. To address this issue and enhance language support for these LLMs, there is ongoing research on whether the underlying model needs to be trained from Figure 1: Admin uploading files for scratch using multilingual data or whether fine-tuning an existing model Question-Answering module that can be with sample multilingual data will suffice or whether some simple effective translated either to or from English prompt engineering techniques will be sufficient or whether we need to translate documents into a high resource language to enable multilingual support [10] [11] [12] [13] [14] [15] [16]. There are parallel ongoing efforts to collect and label data in multiple languages including the low resource languages to improve the training corpus. Evaluating multilingual model performance is also an area of active research since most of the popular model performance benchmarks are also predominately for the English language [17] [18].
arXiv.org Artificial Intelligence
Feb-1-2024