Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT

Raju, Rohit, Pati, Peeta Basa, Gandheesh, SA, Sannala, Gayatri Sanjana, KS, Suriya

Mar-25-2024–arXiv.org Artificial Intelligence

ORC ID: 0000-0003-2376-4591 Abstract Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR & speech recognition are utilized to transform the images and speech signals to text content. All these variety of mechanisms of text generation also introduce error into the captured text. This project aims at analyzing different kinds of errors that occurs in text documents. The work employs two of the advanced deep neural network based language models, namely, BART and MarianMT, for rectifying the anomalies present in text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both the models are able to bring down the erroneous sentences by 20+%, BART is able to handle spelling errors far better (24.6%) than grammatical errors (8.8%). I. Introduction Text is a natural representation of all the existing languages in the world. Texts help one express and communicate with others. Handwritten texts have been part of the history for ages, while digital texts have evolved to keep up with the rapidly growing technology in day to day lives. It is due to texts that one can extend from their knowledge and memory beyond their body into the environment around [1]. Text is available in various forms, from handwritten manuscripts to This is a pre-print version of the paper. Texts can be utilized for personal reasons such as diary entry, blog, etc., as well as for professional purposes like advertising, surveying, etc. Right from the newspaper one reads in the morning to the social media scrolling before going to bed, people are surrounded by text. It is human nature to categorize any kind of data they receive. As there is so much text available around, it is obvious that humans tend to inspect and review the text they require. It is the process of scanning the textual data in order to derive some meaning and store information. Most businesses rely on text analysis to extract valuable insights from various raw sources. The feedback received from these sources such as emails, chat messages, social media posts, comments & statements and survey responses help them in their decision-making strategies.

correction, input sentence, marianmt, (16 more...)

arXiv.org Artificial Intelligence

Mar-25-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Massachusetts > Middlesex County
    - Cambridge (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - Georgia > Fulton County
    - Atlanta (0.04)
  - Colorado > Boulder County
    - Boulder (0.14)
- Europe
  - Poland (0.04)
  - Netherlands (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Finland > Uusimaa
    - Helsinki (0.04)
- Asia
  - China (0.04)
  - Bangladesh (0.04)
  - Sri Lanka > Western Province
    - Colombo > Colombo (0.04)
  - Pakistan > Sindh
    - Karachi Division > Karachi (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - Japan > Honshū
    - Kansai > Kyoto Prefecture > Kyoto (0.04)
  - India > Karnataka
    - Bengaluru (0.04)

Genre:
- Research Report > New Finding (0.46)
- Instructional Material > Course Syllabus & Notes (0.34)

Industry:
- Media (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found