augmented sentence
Data Augmentation Techniques for Process Extraction from Scientific Publications
We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (3 more...)
A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs
Jahan, Md Saroar, Oussalah, Mourad, Beddia, Djamila Romaissa, Mim, Jhuma kabir, Arhab, Nabil
The surge of interest in data augmentation within the realm of NLP has been driven by the need to address challenges posed by hate speech domains, the dynamic nature of social media vocabulary, and the demands for large-scale neural networks requiring extensive training data. However, the prevalent use of lexical substitution in data augmentation has raised concerns, as it may inadvertently alter the intended meaning, thereby impacting the efficacy of supervised machine learning models. In pursuit of suitable data augmentation methods, this study explores both established legacy approaches and contemporary practices such as Large Language Models (LLM), including GPT in Hate Speech detection. Additionally, we propose an optimized utilization of BERT-based encoder models with contextual cosine similarity filtration, exposing significant limitations in prior synonym substitution methods. Our comparative analysis encompasses five popular augmentation techniques: WordNet and Fast-Text synonym replacement, Back-translation, BERT-mask contextual augmentation, and LLM. Our analysis across five benchmarked datasets revealed that while traditional methods like back-translation show low label alteration rates (0.3-1.5%), and BERT-based contextual synonym replacement offers sentence diversity but at the cost of higher label alteration rates (over 6%). Our proposed BERT-based contextual cosine similarity filtration markedly reduced label alteration to just 0.05%, demonstrating its efficacy in 0.7% higher F1 performance. However, augmenting data with GPT-3 not only avoided overfitting with up to sevenfold data increase but also improved embedding space coverage by 15% and classification F1 score by 1.4% over traditional methods, and by 0.8% over our method.
- Europe > Finland > Northern Ostrobothnia > Oulu (0.04)
- Asia > India (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- (6 more...)
- Information Technology (0.47)
- Media > Film (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Iterative Mask Filling: An Effective Text Augmentation Method Using Masked Language Modeling
Kesgin, Himmet Toprak, Amasyali, Mehmet Fatih
Data augmentation is an effective technique for improving the performance of machine learning models. However, it has not been explored as extensively in natural language processing (NLP) as it has in computer vision. In this paper, we propose a novel text augmentation method that leverages the Fill-Mask feature of the transformer-based BERT model. Our method involves iteratively masking words in a sentence and replacing them with language model predictions. We have tested our proposed method on various NLP tasks and found it to be effective in many cases. Our results are presented along with a comparison to existing augmentation methods. Experimental results show that our proposed method significantly improves performance, especially on topic classification datasets.
- North America > United States > New York (0.05)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks
Hu, Xuming, Liu, Aiwei, Tan, Zeqi, Zhang, Xin, Zhang, Chenwei, King, Irwin, Yu, Philip S.
Relation extraction (RE) tasks show promising performance in extracting relations from two entities mentioned in sentences, given sufficient annotations available during training. Such annotations would be labor-intensive to obtain in practice. Existing work adopts data augmentation techniques to generate pseudo-annotated sentences beyond limited annotations. These techniques neither preserve the semantic consistency of the original sentences when rule-based augmentations are adopted, nor preserve the syntax structure of sentences when expressing relations using seq2seq models, resulting in less diverse augmentations. In this work, we propose a dedicated augmentation technique for relational texts, named GDA, which uses two complementary modules to preserve both semantic consistency and syntax structures. We adopt a generative formulation and design a multi-tasking solution to achieve synergies. Furthermore, GDA adopts entity hints as the prior knowledge of the generative model to augment diverse sentences. Experimental results in three datasets under a low-resource setting showed that GDA could bring {\em 2.0\%} F1 improvements compared with no augmentation technique. Source code and data are available.
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Greece > East Macedonia and Thrace > Komotini (0.04)
- (3 more...)
SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning
Mao, Zhenyu, Zhu, Dongsheng, Lu, Jinghui, Zhao, Rui, Tan, Fei
Contrastive learning methods achieve state-of-the-art results in unsupervised sentence representation learning. Although playing essential roles in contrastive learning, data augmentation methods applied on sentences have not been fully explored. Current SOTA method SimCSE utilizes a simple dropout mechanism as continuous augmentation which outperforms discrete augmentations such as cropping, word deletion and synonym replacement. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. Based on the hypothesis, we propose three simple yet effective discrete sentence augmentation methods, i.e., punctuation insertion, affirmative auxiliary and double negation. The punctuation marks, auxiliaries and negative words act as minimal noises in lexical level to produce diverse sentence expressions. Unlike traditional augmentation methods which randomly modify the sentence, our augmentation rules are well designed for generating semantically consistent and grammatically correct sentences. We conduct extensive experiments on both English and Chinese semantic textual similarity datasets. The results show the robustness and effectiveness of the proposed methods.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- North America > Cuba (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification
Rastogi, Chetanya, Mofid, Nikka, Hsiao, Fang-I
This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity. Specifically, we explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms. In this paper, we experiment with Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular learning algorithms, Logistic Regression, Support Vector Machine (SVM), and Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the process of exploring the benefits of data augmentation, we can develop a model to detect and classify toxic speech in comments to help fight back against cyberbullying and online harassment. Ultimately, we found that data augmentation techniques can be used to significantly boost the performance of classifiers and are an excellent strategy to combat lack of data in NLP problems.
CERT: Contrastive Self-supervised Learning for Language Understanding
Fang, Hongchao, Wang, Sicheng, Zhou, Meng, Ding, Jiayuan, Xie, Pengtao
Pretrained language models such as BERT, GPT have shown great effectiveness in language understanding. The auxiliary predictive tasks in existing pretraining approaches are mostly defined on tokens, thus may not be able to capture sentence-level semantics very well. To address this issue, we propose CERT: Contrastive self-supervised Encoder Representations from Transformers, which pretrains language representation models using contrastive self-supervised learning at the sentence level. CERT creates augmentations of original sentences using back-translation. Then it finetunes a pretrained language encoder (e.g., BERT) by predicting whether two augmented sentences originate from the same sentence. CERT is simple to use and can be flexibly plugged into any pretraining-finetuning NLP pipeline. We evaluate CERT on 11 natural language understanding tasks in the GLUE benchmark where CERT outperforms BERT on 7 tasks, achieves the same performance as BERT on 2 tasks, and performs worse than BERT on 2 tasks. On the averaged score of the 11 tasks, CERT outperforms BERT. The data and code are available at https://github.com/UCSD-AI4H/CERT