Abdine, Hadi
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Shang, Guokan, Abdine, Hadi, Khoubrane, Yousef, Mohamed, Amr, Abbahaddou, Yassine, Ennadir, Sofiane, Momayiz, Imane, Ren, Xuguang, Moulines, Eric, Nakov, Preslav, Vazirgiannis, Michalis, Xing, Eric
We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.
Graph Linearization Methods for Reasoning on Graphs with Large Language Models
Xypolopoulos, Christos, Shang, Guokan, Fei, Xiao, Nikolentzos, Giannis, Abdine, Hadi, Evdaimon, Iakovos, Chatzianastasis, Michail, Stamou, Giorgos, Vazirgiannis, Michalis
Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph machine learning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term graph linearization, so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality, degeneracy, and node relabeling schemes. We then investigated their effect on LLM performance in graph reasoning tasks. Experimental results on synthetic graphs demonstrate the effectiveness of our methods compared to random linearization baselines. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multi-modal processing using a unified transformer model.
Neural Graph Generator: Feature-Conditioned Graph Generation using Latent Diffusion Models
Evdaimon, Iakovos, Nikolentzos, Giannis, Chatzianastasis, Michail, Abdine, Hadi, Vazirgiannis, Michalis
In recent years, the field of machine learning on graphs has witnessed an extensive growth, mainly due to the availability of large amounts of data represented as graphs. Indeed, graphs arise naturally in several application domains such as in social networks, in chemo-informatics and in bio-informatics. One of the most challenging tasks of machine learning on graphs is that of graph generation [Zhu et al., 2022]. Graph generation has attracted a lot of attention recently and its main objective is to create novel and realistic graphs. For instance, in chemo-informatics, graph generative models are employed to generate novel, realistic molecular graphs which also exhibit desired properties (e. g., high drug-likeness) [Jin et al., 2018, Zang and Wang, 2020]. Recently, there is a surge of interest in developing new graph generative models, and most of the proposed models typically fall into one of the following five families of models: (1) Auto-Regressive models; (2) Variational Autoencoders; (3) Generative Adversarial Networks; (4) Normalizing Flows; and (5) Diffusion models. These models can capture the complex structural and semantic information of graphs, but focus mainly on specific types of graphs such as molecules [Hoogeboom et al., 2022], proteins [Ingraham et al., 2019], computer programs [Brockschmidt et al., 2019] or patient trajectories [Nikolentzos et al., 2023]. Traditionally, in different application domains, there is a need for generating graphs that exhibit specific properties (e. g., degree distribution, node triangle participation, community structure, etc.).
Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers
Abdine, Hadi, Chatzianastasis, Michail, Bouyioukos, Costas, Vazirgiannis, Michalis
The complex nature of big biological systems pushed some scientists to classify its understanding under the inconceivable missions. Different leveled challenges complicated this task, one of is the prediction of a protein's function. In recent years, significant progress has been made in this field through the development of various machine learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e assigning predefined labels to proteins. In this work, we propose a novel approach, \textbf{Prot2Text}, which predicts a protein function's in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including proteins' sequences, structures, and textual annotations. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate prediction of proteins' functions. The code, the models and a demo will be publicly released.
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Evdaimon, Iakovos, Abdine, Hadi, Xypolopoulos, Christos, Outsios, Stamatis, Vazirgiannis, Michalis, Stamou, Giorgos
The era of transfer learning has revolutionized the fields of Computer Vision and Natural Language Processing, bringing powerful pretrained models with exceptional performance across a variety of tasks. Specifically, Natural Language Processing tasks have been dominated by transformer-based language models. In Natural Language Inference and Natural Language Generation tasks, the BERT model and its variants, as well as the GPT model and its successors, demonstrated exemplary performance. However, the majority of these models are pretrained and assessed primarily for the English language or on a multilingual corpus. In this paper, we introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus. We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks. In addition, we examine its performance on two NLG tasks from GreekSUM, a newly introduced summarization dataset for the Greek language. The model, the code, and the new summarization dataset will be publicly available.