AITopics | correct sentence

Collaborating Authors

correct sentence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Training-free LLM-based Approach to General Chinese Character Error Correction

Zhou, Houquan, Zhang, Bo, Li, Zhenghua, Yan, Ming, Zhang, Min

arXiv.org Artificial IntelligenceFeb-21-2025

Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

dataset, wang, zhang, (16 more...)

arXiv.org Artificial Intelligence

2502.15266

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction

Alrehili, Ahlam, Alhothali, Areej

arXiv.org Artificial IntelligenceNov-7-2024

Natural language processing (NLP) utilizes text data augmentation to overcome sample size constraints. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC). Furthermore, QALB-14 and QALB-15 are the only datasets used in most Arabic grammatical error correction research, with approximately 20,500 parallel examples, which is considered low compared with other languages. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including the collection and pre-processing of a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured that they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49 of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.

chatgpt, corpus, correction, (14 more...)

arXiv.org Artificial Intelligence

2411.04588

Country:

North America > United States > Arizona (0.04)
Asia > Middle East > Qatar (0.04)
North America > United States > Minnesota (0.04)
(3 more...)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

Huang, Kuan-Po, Yang, Chih-Kai, Fu, Yu-Kuan, Dunbar, Ewan, Lee, Hung-yi

arXiv.org Artificial IntelligenceDec-16-2023

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to demonstrate how the code-switching abilities of speech encoders can be assessed in a zero-resource manner. Our experiments encompass a variety of well-known speech encoders, including Wav2vec 2.0, HuBERT, XLSR, etc. We examine the impact of pre-training languages and model size on benchmark performance. Notably, though our results demonstrate that speech encoders with multilingual pre-training, exemplified by XLSR, outperform monolingual variants (Wav2vec 2.0, HuBERT) in code-switching scenarios, there is still substantial room for improvement in their code-switching linguistic abilities.

baseline system, language model, speech encoder, (12 more...)

arXiv.org Artificial Intelligence

2310.03018

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.50)

Add feedback

CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME

Hu, Yong, Meng, Fandong, Zhou, Jie

arXiv.org Artificial IntelligenceNov-24-2022

Chinese Spelling Correction (CSC) is a task to detect and correct spelling mistakes in texts. In fact, most of Chinese input is based on pinyin input method, so the study of spelling errors in this process is more practical and valuable. However, there is still no research dedicated to this essential scenario. In this paper, we first present a Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated sentences from real posts of official media on Sina Weibo. Furthermore, we propose a novel method to automatically construct large-scale and high-quality pseudo data by simulating the input through pinyin IME. A series of analyses and experiments on CSCD-IME show that spelling errors produced by pinyin IME hold a particular distribution at pinyin level and semantic level and are challenging enough. Meanwhile, our proposed pseudo-data construction method can better fit this error distribution and improve the performance of CSC systems. Finally, we provide a useful guide to using pseudo data, including the data scale, the data source, and the training strategy.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2211.08788

Country: Asia > China (0.04)

Genre:

Research Report (0.70)
Summary/Review (0.54)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Assistive Completion of Agrammatic Aphasic Sentences: A Transfer Learning Approach using Neurolinguistics-based Synthetic Dataset

Misra, Rohit, Mishra, Sapna S, Gandhi, Tapan K.

arXiv.org Artificial IntelligenceNov-10-2022

Damage to the inferior frontal gyrus (Broca's area) can cause agrammatic aphasia wherein patients, although able to comprehend, lack the ability to form complete sentences. This inability leads to communication gaps which cause difficulties in their daily lives. The usage of assistive devices can help in mitigating these issues and enable the patients to communicate effectively. However, due to lack of large scale studies of linguistic deficits in aphasia, research on such assistive technology is relatively limited. In this work, we present two contributions that aim to re-initiate research and development in this field. Firstly, we propose a model that uses linguistic features from small scale studies on aphasia patients and generates large scale datasets of synthetic aphasic utterances from grammatically correct datasets. We show that the mean length of utterance, the noun/verb ratio, and the simple/complex sentence ratio of our synthetic datasets correspond to the reported features of aphasic speech. Further, we demonstrate how the synthetic datasets may be utilized to develop assistive devices for aphasia patients. The pre-trained T5 transformer is fine-tuned using the generated dataset to suggest 5 corrected sentences given an aphasic utterance as input. We evaluate the efficacy of the T5 model using the BLEU and cosine semantic similarity scores. Affirming results with BLEU score of 0.827/1.00 and semantic similarity of 0.904/1.00 were obtained. These results provide a strong foundation for the concept that a synthetic dataset based on small scale studies on aphasia can be used to develop effective assistive technology.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2211.05557

Country:

Oceania > Australia (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.41)

Add feedback

Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction

Ma, Shirong, Li, Yinghui, Sun, Rongyi, Zhou, Qingyu, Huang, Shulin, Zhang, Ding, Yangning, Li, Liu, Ruiyang, Li, Zhongli, Cao, Yunbo, Zheng, Haitao, Shen, Ying

arXiv.org Artificial IntelligenceOct-19-2022

Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.

artificial intelligence, grammatical error, natural language, (15 more...)

arXiv.org Artificial Intelligence

2210.10442

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Inner Mongolia > Hohhot (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
(13 more...)

Genre: Research Report (0.82)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Judge a Sentence by Its Content to Generate Grammatical Errors

Rahman, Chowdhury Rafeed

arXiv.org Artificial IntelligenceAug-20-2022

Data sparsity is a well-known problem for grammatical error correction (GEC). Generating synthetic training data is one widely proposed solution to this problem, and has allowed models to achieve state-of-the-art (SOTA) performance in recent years. However, these methods often generate unrealistic errors, or aim to generate sentences with only one error. We propose a learning based two stage method for synthetic data generation for GEC that relaxes this constraint on sentences containing only one error. Errors are generated in accordance with sentence merit. We show that a GEC model trained on our synthetically generated corpus outperforms models trained on synthetic data from prior work.

computational linguistic, correct sentence, correction, (13 more...)

arXiv.org Artificial Intelligence

2208.09693

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.64)

Add feedback

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Caswell, Isaac, Kreutzer, Julia, Wang, Lisa, Wahab, Ahsan, van Esch, Daan, Ulzii-Orshikh, Nasanbayar, Tapo, Allahsera, Subramani, Nishant, Sokolov, Artem, Sikasote, Claytone, Setyawan, Monang, Sarin, Supheakmungkol, Samb, Sokhar, Sagot, Benoît, Rivera, Clara, Rios, Annette, Papadimitriou, Isabel, Osei, Salomey, Suárez, Pedro Javier Ortiz, Orife, Iroro, Ogueji, Kelechi, Niyongabo, Rubungo Andre, Nguyen, Toan Q., Müller, Mathias, Müller, André, Muhammad, Shamsuddeen Hassan, Muhammad, Nanda, Mnyakeni, Ayanda, Mirzakhalov, Jamshidbek, Matangira, Tapiwanashe, Leong, Colin, Lawson, Nze, Kudugunta, Sneha, Jernite, Yacine, Jenny, Mathias, Firat, Orhan, Dossou, Bonaventure F. P., Dlamini, Sakhile, de Silva, Nisansa, Ballı, Sakine Çabuk, Biderman, Stella, Battisti, Alessia, Baruwa, Ahmed, Bapna, Ankur, Baljekar, Pallavi, Azime, Israel Abebe, Awokoya, Ayodele, Ataman, Duygu, Ahia, Orevaoghene, Ahia, Oghenefego, Agrawal, Sweta, Adeyemi, Mofetoluwa

arXiv.org Artificial IntelligenceMar-22-2021

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

computational linguistic, dataset, translation, (16 more...)

arXiv.org Artificial Intelligence

2103.12028

Country:

Africa > South Africa (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(25 more...)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.67)
Media (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

How Controlled English can Improve Semantic Wikis

Kuhn, Tobias

arXiv.org Artificial IntelligenceJul-7-2009

The motivation of semantic wikis is to make acquisition, maintenance, and mining of formal knowledge simpler, faster, and more flexible. However, most existing semantic wikis have a very technical interface and are restricted to a relatively low level of expressivity. In this paper, we explain how AceWiki uses controlled English -- concretely Attempto Controlled English (ACE) -- to provide a natural and intuitive interface while supporting a high degree of expressivity. We introduce recent improvements of the AceWiki system and user studies that indicate that AceWiki is usable and useful.

acewiki, artificial intelligence, natural language, (16 more...)

arXiv.org Artificial Intelligence

0907.1245

Country: Europe > Switzerland (0.14)

Genre:

Research Report (0.82)
Questionnaire & Opinion Survey (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Web > Semantic Web (0.70)

Add feedback

AceWiki: A Natural and Expressive Semantic Wiki

Kuhn, Tobias

arXiv.org Artificial IntelligenceJul-29-2008

We present AceWiki, a prototype of a new kind of semantic wiki using the controlled natural language Attempto Controlled English (ACE) for representing its content. ACE is a subset of English with a restricted grammar and a formal semantics. The use of ACE has two important advantages over existing semantic wikis. First, we can improve the usability and achieve a shallow learning curve. Second, ACE is more expressive than the formal languages of existing semantic wikis. Our evaluation shows that people who are not familiar with the formal foundations of the Semantic Web are able to deal with AceWiki after a very short learning phase and without the help of an expert.

artificial intelligence, knowledge management, natural language, (14 more...)

arXiv.org Artificial Intelligence

0807.4618

Country: Europe > Switzerland (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Knowledge Management (1.00)
Information Technology > Communications > Web > Semantic Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback