correct sentence
A Training-free LLM-based Approach to General Chinese Character Error Correction
Zhou, Houquan, Zhang, Bo, Li, Zhenghua, Yan, Ming, Zhang, Min
Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.
Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction
Alrehili, Ahlam, Alhothali, Areej
Natural language processing (NLP) utilizes text data augmentation to overcome sample size constraints. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC). Furthermore, QALB-14 and QALB-15 are the only datasets used in most Arabic grammatical error correction research, with approximately 20,500 parallel examples, which is considered low compared with other languages. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including the collection and pre-processing of a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured that they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49 of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.
Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages
Huang, Kuan-Po, Yang, Chih-Kai, Fu, Yu-Kuan, Dunbar, Ewan, Lee, Hung-yi
We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech encoders, including Wav2vec 2.0, HuBERT, XLSR, etc. We examine the impact of pre-training languages and model size on benchmark performance. Notably, though our results demonstrate that speech encoders with multilingual pre-training, exemplified by XLSR, outperform monolingual variants (Wav2vec 2.0, HuBERT) in code-switching scenarios, there is still substantial room for improvement in their code-switching linguistic abilities.
CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME
Hu, Yong, Meng, Fandong, Zhou, Jie
Chinese Spelling Correction (CSC) is a task to detect and correct spelling mistakes in texts. In fact, most of Chinese input is based on pinyin input method, so the study of spelling errors in this process is more practical and valuable. However, there is still no research dedicated to this essential scenario. In this paper, we first present a Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated sentences from real posts of official media on Sina Weibo. Furthermore, we propose a novel method to automatically construct large-scale and high-quality pseudo data by simulating the input through pinyin IME. A series of analyses and experiments on CSCD-IME show that spelling errors produced by pinyin IME hold a particular distribution at pinyin level and semantic level and are challenging enough. Meanwhile, our proposed pseudo-data construction method can better fit this error distribution and improve the performance of CSC systems. Finally, we provide a useful guide to using pseudo data, including the data scale, the data source, and the training strategy.
Assistive Completion of Agrammatic Aphasic Sentences: A Transfer Learning Approach using Neurolinguistics-based Synthetic Dataset
Misra, Rohit, Mishra, Sapna S, Gandhi, Tapan K.
Damage to the inferior frontal gyrus (Broca's area) can cause agrammatic aphasia wherein patients, although able to comprehend, lack the ability to form complete sentences. This inability leads to communication gaps which cause difficulties in their daily lives. The usage of assistive devices can help in mitigating these issues and enable the patients to communicate effectively. However, due to lack of large scale studies of linguistic deficits in aphasia, research on such assistive technology is relatively limited. In this work, we present two contributions that aim to re-initiate research and development in this field. Firstly, we propose a model that uses linguistic features from small scale studies on aphasia patients and generates large scale datasets of synthetic aphasic utterances from grammatically correct datasets. We show that the mean length of utterance, the noun/verb ratio, and the simple/complex sentence ratio of our synthetic datasets correspond to the reported features of aphasic speech. Further, we demonstrate how the synthetic datasets may be utilized to develop assistive devices for aphasia patients. The pre-trained T5 transformer is fine-tuned using the generated dataset to suggest 5 corrected sentences given an aphasic utterance as input. We evaluate the efficacy of the T5 model using the BLEU and cosine semantic similarity scores. Affirming results with BLEU score of 0.827/1.00 and semantic similarity of 0.904/1.00 were obtained. These results provide a strong foundation for the concept that a synthetic dataset based on small scale studies on aphasia can be used to develop effective assistive technology.
Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction
Ma, Shirong, Li, Yinghui, Sun, Rongyi, Zhou, Qingyu, Huang, Shulin, Zhang, Ding, Yangning, Li, Liu, Ruiyang, Li, Zhongli, Cao, Yunbo, Zheng, Haitao, Shen, Ying
Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.
Judge a Sentence by Its Content to Generate Grammatical Errors
Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage method for synthetic data generation for GEC that relaxes this constraint on sentences containing only one error. Errors are generated in accordance with sentence merit. We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Caswell, Isaac, Kreutzer, Julia, Wang, Lisa, Wahab, Ahsan, van Esch, Daan, Ulzii-Orshikh, Nasanbayar, Tapo, Allahsera, Subramani, Nishant, Sokolov, Artem, Sikasote, Claytone, Setyawan, Monang, Sarin, Supheakmungkol, Samb, Sokhar, Sagot, Benoît, Rivera, Clara, Rios, Annette, Papadimitriou, Isabel, Osei, Salomey, Suárez, Pedro Javier Ortiz, Orife, Iroro, Ogueji, Kelechi, Niyongabo, Rubungo Andre, Nguyen, Toan Q., Müller, Mathias, Müller, André, Muhammad, Shamsuddeen Hassan, Muhammad, Nanda, Mnyakeni, Ayanda, Mirzakhalov, Jamshidbek, Matangira, Tapiwanashe, Leong, Colin, Lawson, Nze, Kudugunta, Sneha, Jernite, Yacine, Jenny, Mathias, Firat, Orhan, Dossou, Bonaventure F. P., Dlamini, Sakhile, de Silva, Nisansa, Ballı, Sakine Çabuk, Biderman, Stella, Battisti, Alessia, Baruwa, Ahmed, Bapna, Ankur, Baljekar, Pallavi, Azime, Israel Abebe, Awokoya, Ayodele, Ataman, Duygu, Ahia, Orevaoghene, Ahia, Oghenefego, Agrawal, Sweta, Adeyemi, Mofetoluwa
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.
How Controlled English can Improve Semantic Wikis
The motivation of semantic wikis is to make acquisition, maintenance, and mining of formal knowledge simpler, faster, and more flexible. However, most existing semantic wikis have a very technical interface and are restricted to a relatively low level of expressivity. In this paper, we explain how AceWiki uses controlled English -- concretely Attempto Controlled English (ACE) -- to provide a natural and intuitive interface while supporting a high degree of expressivity. We introduce recent improvements of the AceWiki system and user studies that indicate that AceWiki is usable and useful.
AceWiki: A Natural and Expressive Semantic Wiki
We present AceWiki, a prototype of a new kind of semantic wiki using the controlled natural language Attempto Controlled English (ACE) for representing its content. ACE is a subset of English with a restricted grammar and a formal semantics. The use of ACE has two important advantages over existing semantic wikis. First, we can improve the usability and achieve a shallow learning curve. Second, ACE is more expressive than the formal languages of existing semantic wikis. Our evaluation shows that people who are not familiar with the formal foundations of the Semantic Web are able to deal with AceWiki after a very short learning phase and without the help of an expert.