Dousti, Mohammad Javad
$D^2LoRA$: Data-Driven LoRA Initialization for Low Resource Tasks
SeraJ, Javad, Mohajeri, Mohammad Mahdi, Dousti, Mohammad Javad
Tuning large language models is essential for optimizing their performance across diverse applications, particularly in scenarios with limited data availability. Tuning large language models in scarce data scenarios is crucial, particularly given that the convergence speed of the LoRA method is lower than that of full fine-tuning. In this paper, we present an analysis of post-training methods including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO) within the context of task-specific learning using the LoRA method. Next we introduce $D^2LoRA$, a data-driven approach for initializing LoRA metrics that enhances training efficiency, especially in limited-data settings. Our experiments compare $D^2LoRA$ with vanilla LoRA in terms of performance and catastrophic forgetting under extremely data-constrained conditions. The results demonstrate that $D^2LoRA$ achieves a 1% improvement GSM8K benchmark and a 2-point improvement in ROUGE score in title generation tasks. $D^2LoRA$ facilitates the adaptation of LLMs to multiple tasks even when task-specific data is scarce, thereby reducing training expenses and offering data cost.
Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning
Ghanizadeh, Mohammad Amin, Dousti, Mohammad Javad
In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs) and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient language models that better mimic human learning processes.
Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation
Seraj, Javad, Mohajeri, Mohammad Mahdi, Dousti, Mohammad Javad, Ahmadabadi, Majid Nili
Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.
CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt
Mohajeri, Mohammad Mahdi, Dousti, Mohammad Javad, Ahmadabadi, Majid Nili
Text classification is a fundamental task in natural language processing (NLP), and large language models (LLMs) have demonstrated their capability to perform this task across various domains. However, the performance of LLMs heavily depends on the quality of their input prompts. Recent studies have also shown that LLMs exhibit remarkable results in code-related tasks. To leverage the capabilities of LLMs in text classification, we propose the Code Completion Prompt (CoCoP) method, which transforms the text classification problem into a code completion task. CoCoP significantly improves text classification performance across diverse datasets by utilizing LLMs' code-completion capability. For instance, CoCoP enhances the accuracy of the SST2 dataset by more than 20%. Moreover, when CoCoP integrated with LLMs specifically designed for code-related tasks (code models), such as CodeLLaMA, this method demonstrates better or comparable performance to few-shot learning techniques while using only one-tenth of the model size. The source code of our proposed method will be available to the public upon the acceptance of the paper.
CULL-MT: Compression Using Language and Layer pruning for Machine Translation
Rostami, Pedram, Dousti, Mohammad Javad
Multilingual machine translation models often outperform traditional bilingual models by leveraging translation knowledge transfer. Recent advancements have led to these models supporting hundreds of languages and achieving state-of-the-art results across various translation directions. However, as these models grow larger, their inference operations become increasingly costly. In many use cases, there is no need to support such a wide range of language pairs, as translation is typically needed in only a few selected directions. In this paper, we present CULL-MT, a compression method for machine translation models based on structural layer pruning and selected language directions. Our approach identifies and prunes unimportant layers using a greedy strategy, then mitigates the impact by applying knowledge distillation from the original model along with parameter-efficient fine-tuning. We apply CULL-MT to the NLLB-3.3B and LLaMA3.1-8B-Instruct models. In a multi-way translation scenario (Persian, French, and German to English), we find the NLLB-3.3B model to be robust, allowing 25% of layers to be pruned with only a 0.9 spBLEU drop. However, LLaMA3.1-8B-Instruct is more sensitive, with a 2.0 spBLEU drop after pruning 5 layers.
PersianMind: A Cross-Lingual Persian-English Large Language Model
Rostami, Pedram, Salemi, Ali, Dousti, Mohammad Javad
Large language models demonstrate remarkable proficiency in various linguistic tasks and have extensive knowledge across various domains. Although they perform best in English, their ability in other languages is notable too. In contrast, open-source models, such as LLaMa, are primarily trained on English datasets, resulting in poor performance in non-English languages. In this paper, we introduce PersianMind, an open-source bilingual large language model which demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language. By expanding LLaMa2's vocabulary with 10,000 Persian tokens and training it on a dataset comprising nearly 2 billion Persian tokens, we show that our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.
Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages
Azadi, Fatemeh, Faili, Heshaam, Dousti, Mohammad Javad
Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, while employing it results in two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of WMT21 QE shared task, as well as a new English-Farsi test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.
TEASEL: A Transformer-Based Speech-Prefixed Language Model
Arjmand, Mehdi, Dousti, Mohammad Javad, Moradi, Hadi
Multimodal language analysis is a burgeoning field of NLP that aims to simultaneously model a speaker's words, acoustical annotations, and facial expressions. In this area, lexicon features usually outperform other modalities because they are pre-trained on large corpora via Transformer-based models. Despite their strong performance, training a new self-supervised learning (SSL) Transformer on any modality is not usually attainable due to insufficient data, which is the case in multimodal language learning. This work proposes a Transformer-Based Speech-Prefixed Language Model called TEASEL to approach the mentioned constraints without training a complete Transformer model. TEASEL model includes speech modality as a dynamic prefix besides the textual modality compared to a conventional language model. This method exploits a conventional pre-trained language model as a cross-modal Transformer model. We evaluated TEASEL for the multimodal sentiment analysis task defined by CMU-MOSI dataset. Extensive experiments show that our model outperforms unimodal baseline language models by 4% and outperforms the current multimodal state-of-the-art (SoTA) model by 1% in F1-score. Additionally, our proposed method is 72% smaller than the SoTA model.