saggion
MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish
Bott, Stefan, Saggion, Horacio, Rojas, Nelson Peréz, Salazar, Martin Solis, Ramirez, Saul Calderon
Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, MultiLS-SP is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we describe experiments with this dataset, which can serve as a baseline for future work on the same data.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Bulgaria > Varna Province > Varna (0.04)
- North America > United States > Maryland (0.04)
- (9 more...)
- Government (0.67)
- Education (0.46)
A Novel Dataset for Financial Education Text Simplification in Spanish
Perez-Rojas, Nelson, Calderon-Ramirez, Saul, Solis-Salazar, Martin, Romero-Sandoval, Mario, Arias-Monge, Monica, Saggion, Horacio
Text simplification, crucial in natural language processing, aims to make texts more comprehensible, particularly for specific groups like visually impaired Spanish speakers, a less-represented language in this field. In Spanish, there are few datasets that can be used to create text simplification systems. Our research has the primary objective to develop a Spanish financial text simplification dataset. We created a dataset with 5,314 complex and simplified sentence pairs using established simplification rules. We also compared our dataset with the simplifications generated from GPT-3, Tuner, and MT5, in order to evaluate the feasibility of data augmentation using these systems. In this manuscript we present the characteristics of our dataset and the findings of the comparisons with other systems. The dataset is available at Hugging face, saul1917/FEINA.
- North America > Costa Rica (0.05)
- Europe > Spain > Galicia > Madrid (0.04)
- South America > Ecuador > Guayas Province > Guayaquil (0.04)
- (4 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
Multilingual Controllable Transformer-Based Lexical Simplification
Sheang, Kim Cheng, Saggion, Horacio
Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- Asia > China > Hong Kong (0.04)
- (14 more...)
Controllable Lexical Simplification for English
Sheang, Kim Cheng, Ferrés, Daniel, Saggion, Horacio
Fine-tuning Transformer-based approaches have recently shown exciting results on sentence simplification task. However, so far, no research has applied similar approaches to the Lexical Simplification (LS) task. In this paper, we present ConLS, a Controllable Lexical Simplification system fine-tuned with T5 (a Transformer-based model pre-trained with a BERT-style approach and several other tasks). The evaluation results on three datasets (LexMTurk, BenchLS, and NNSeval) have shown that our model performs comparable to LSBert (the current state-of-the-art) and even outperforms it in some cases. We also conducted a detailed comparison on the effectiveness of control tokens to give a clear view of how each token contributes to the model.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Maryland (0.04)
- (5 more...)
Explain to me like I am five -- Sentence Simplification Using Transformers
Sentence simplification aims at making the structure of text easier to read and understand while maintaining its original meaning. This can be helpful for people with disabilities, new language learners, or those with low literacy. Simplification often involves removing difficult words and rephrasing the sentence. Previous research have focused on tackling this task by either using external linguistic databases for simplification or by using control tokens for desired fine-tuning of sentences. However, in this paper we purely use pre-trained transformer models. We experiment with a combination of GPT-2 and BERT models, achieving the best SARI score of 46.80 on the Mechanical Turk dataset, which is significantly better than previous state-of-the-art results. The code can be found at https://github.com/amanbasu/sentence-simplification.
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- Europe > Germany > Berlin (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
North, Kai, Zampieri, Marcos, Ranasinghe, Tharindu
Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.
- South America > Brazil (0.05)
- North America > United States (0.04)
- Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)
- Europe > Spain (0.04)