Bucharest
CBM: Curriculum by Masking
Jarca, Andrei, Croitoru, Florinel-Alin, Ionescu, Radu Tudor
We propose Curriculum by Masking (CBM), a novel state-of-the-art curriculum learning strategy that effectively creates an easy-to-hard training schedule via patch (token) masking, offering significant accuracy improvements over the conventional training regime and previous curriculum learning (CL) methods. CBM leverages gradient magnitudes to prioritize the masking of salient image regions via a novel masking algorithm and a novel masking block. Our approach enables controlling sample difficulty via the patch masking ratio, generating an effective easy-to-hard curriculum by gradually introducing harder samples as training progresses. CBM operates with two easily configurable parameters, i.e. the number of patches and the curriculum schedule, making it a versatile curriculum learning approach for object recognition and detection. We conduct experiments with various neural architectures, ranging from convolutional networks to vision transformers, on five benchmark data sets (CIFAR-10, CIFAR-100, ImageNet, Food-101 and PASCAL VOC), to compare CBM with conventional as well as curriculum-based training regimes. Our results reveal the superiority of our strategy compared with the state-of-the-art curriculum learning regimes. We also observe improvements in transfer learning contexts, where CBM surpasses previous work by considerable margins in terms of accuracy. We release our code for free non-commercial use at https://github.com/CroitoruAlin/CBM.
EVA-Score: Evaluation of Long-form Summarization on Informativeness through Extraction and Validation
Fan, Yuchen, Zhong, Xin, Wang, Chengsi, Wu, Gaoche, Zhou, Bowen
Summarization is a fundamental task in natural language processing (NLP) and since large language models (LLMs), such as GPT-4 and Claude, come out, increasing attention has been paid to long-form summarization whose input sequences are much longer, indicating more information contained. The current evaluation metrics either use similarity-based metrics like ROUGE and BERTScore which rely on similarity and fail to consider informativeness or LLM-based metrics, lacking quantitative analysis of information richness and are rather subjective. In this paper, we propose a new evaluation metric called EVA-Score using Atomic Fact Chain Generation and Document-level Relation Extraction together to automatically calculate the informativeness and give a definite number as an information score. Experiment results show that our metric shows a state-of-the-art correlation with humans. We also re-evaluate the performance of LLMs on long-form summarization comprehensively from the information aspect, forecasting future ways to use LLMs for long-form summarization.
PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts
Rogoz, Ana-Cristina, Nechita, Maria Ilinca, Ionescu, Radu Tudor
We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.
LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation
Yin, Yongjing, Zeng, Jiali, Li, Yafu, Meng, Fandong, Zhang, Yue
The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation. These results underscore the effectiveness of LexMatcher in enhancing LLM-based machine translation. The code, data, and models are available at https://github.com/ARIES-LM/Lexmatcher-MT.git.
A Look Into Training Large Language Models on Next Generation Datacenters
Gherghescu, Alexandru M., Bădoiu, Vlad-Andrei, Agache, Alexandru, Dumitru, Mihai-Valentin, Vasilescu, Iuliu, Mantu, Radu, Raiciu, Costin
Is it still worth doing computer networking research? What are relevant problems in this space given the supremacy of hyperscalers in deployed large networks? We take an unconventional approach to finding relevant research directions, by starting from Microsoft's plans to build a $100 billion datacenter for ML. Our goal is to understand what models could be trained in such a datacenter, as well as the high-level challenges one may encounter in doing so. We first examine the constraints imposed by cooling and power requirements for our target datacenter and find that it is infeasible to build in a single location. We use LLM scaling laws to determine that we could train models of 50T or 100T. Finally, we examine how distributed training might work for these models, and what the networking requirements are. We conclude that building the datacenter and training such models is technically possible, but this requires a novel NIC-based multipath transport along with a redesign of the entire training stack, outlining a research agenda for our community in the near future.
"Vorbe\c{s}ti Rom\^ane\c{s}te?" A Recipe to Train Powerful Romanian LLMs with English Instructions
Masala, Mihai, Ilie-Ablachim, Denis C., Dima, Alexandru, Corlatescu, Dragos, Zavelca, Miruna, Olaru, Ovio, Terian, Simina, Terian, Andrei, Leordeanu, Marius, Velicu, Horia, Popescu, Marius, Dascalu, Mihai, Rebedea, Traian
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
Prose-to-P4: Leveraging High Level Languages
Dumitru, Mihai-Valentin, Bădoiu, Vlad-Andrei, Raiciu, Costin
Languages such as P4 and NPL have enabled a wide and diverse range of networking applications that take advantage of programmable dataplanes. However, software development in these languages is difficult. To address this issue, high-level languages have been designed to offer programmers powerful abstractions that reduce the time, effort and domain-knowledge required for developing networking applications. These languages are then translated by a compiler into P4/NPL code. Inspired by the recent success of Large Language Models (LLMs) in the task of code generation, we propose to raise the level of abstraction even higher, employing LLMs to translate prose into high-level networking code. We analyze the problem, focusing on the motivation and opportunities, as well as the challenges involved and sketch out a roadmap for the development of a system that can generate high-level dataplane code from natural language instructions. We present some promising preliminary results on generating Lucid code from natural language.
MAiDE-up: Multilingual Deception Detection of GPT-generated Hotel Reviews
Ignat, Oana, Xu, Xiaomeng, Mihalcea, Rada
Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
Maurya, Avinash, Underwood, Robert, Rafique, M. Mustafa, Cappello, Franck, Nicolae, Bogdan
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Thus, LLMs need to be checkpointed frequently so that they can be rolled back to a stable state and subsequently fine-tuned. However, given the large sizes of LLMs, a straightforward checkpointing solution that directly writes the model parameters and optimizer state to persistent storage (e.g., a parallel file system), incurs significant I/O overheads. To address this challenge, in this paper we study how to reduce the I/O overheads for enabling fast and scalable checkpointing for LLMs that can be applied at high frequency (up to the granularity of individual iterations) without significant impact on the training process. Specifically, we introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and optimizer state shards remain immutable for extended periods of time, which makes it possible to copy their content in the background with minimal interference during the training process. We evaluate our approach at scales of up to 180 GPUs using different model sizes, parallelism settings, and checkpointing frequencies. The results show up to 48$\times$ faster checkpointing and 2.2$\times$ faster end-to-end training runtime compared with the state-of-art checkpointing approaches.
Introducing Brain-like Concepts to Embodied Hand-crafted Dialog Management System
Joublin, Frank, Ceravola, Antonello, Sandu, Cristian
Along with the development of chatbot, language models and speech technologies, there is a growing possibility and interest of creating systems able to interface with humans seamlessly through natural language or directly via speech. In this paper, we want to demonstrate that placing the research on dialog system in the broader context of embodied intelligence allows to introduce concepts taken from neurobiology and neuropsychology to define behavior architecture that reconcile hand-crafted design and artificial neural network and open the gate to future new learning approaches like imitation or learning by instruction. To do so, this paper presents a neural behavior engine that allows creation of mixed initiative dialog and action generation based on hand-crafted models using a graphical language. A demonstration of the usability of such brain-like inspired architecture together with a graphical dialog model is described through a virtual receptionist application running on a semi-public space.