Optical Character Recognition: Overviews
Revisiting Noise in Natural Language Processing for Computational Social Science
Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.
LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR
Abdellaif, Osama Hosam, Nader, Abdelrahman, Hamdi, Ali
This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.
A comprehensive survey of oracle character recognition: challenges, benchmarks, and beyond
Li, Jing, Chi, Xueke, Wang, Qiufeng, Wang, Dahan, Huang, Kaizhu, Liu, Yongge, Liu, Cheng-lin
Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.
Scaling Automatic Extraction of Pseudocode
Toksoz, Levent, Tan, Gang, Giles, C. Lee
Pseudocode in a scholarly paper provides a concise way to express the algorithms implemented therein. Pseudocode can also be thought of as an intermediary representation that helps bridge the gap between programming languages and natural languages. Having access to a large collection of pseudocode can provide various benefits ranging from enhancing algorithmic understanding, facilitating further algorithmic design, to empowering NLP or computer vision based models for tasks such as automated code generation and optical character recognition (OCR). We have created a large pseudocode collection by extracting nearly 320,000 pseudocode examples from arXiv papers. This process involved scanning over $2.2$ million scholarly papers, with 1,000 of them being manually inspected and labeled. Our approach encompasses an extraction mechanism tailored to optimize the coverage and a validation mechanism based on random sampling to check its accuracy and reliability, given the inherent heterogeneity of the collection. In addition, we offer insights into common pseudocode structures, supported by clustering and statistical analyses. Notably, these analyses indicate an exponential-like growth in the usage of pseudocodes, highlighting their increasing significance.
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.
Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey
Kasem, Mahmoud SalahEldin, Mahmoud, Mohamed, Kang, Hyun-Soo
Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language.
MultiQG-TI: Towards Question Generation from Multi-modal Sources
Wang, Zichao, Baraniuk, Richard
We study the new problem of automatic question generation (QG) from multi-modal sources containing images and texts, significantly expanding the scope of most of the existing work that focuses exclusively on QG from only textual sources. We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input in addition to textual input. Specifically, we leverage an image-to-text model and an optical character recognition model to obtain the textual description of the image and extract any texts in the image, respectively, and then feed them together with the input texts to the question generator. We only fine-tune the question generator while keeping the other components fixed. On the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters. Additional analyses empirically confirm the necessity of both visual and textual signals for QG and show the impact of various modeling choices.
Modelling and Optimisation of Resource Usage in an IoT Enabled Smart Campus
University campuses are essentially a microcosm of a city. They comprise diverse facilities such as residences, sport centres, lecture theatres, parking spaces, and public transport stops. Universities are under constant pressure to improve efficiencies while offering a better experience to various stakeholders including students, staff, and visitors. Nonetheless, anecdotal evidence indicates that campus assets are not being utilised efficiently, often due to the lack of data collection and analysis, thereby limiting the ability to make informed decisions on the allocation and management of resources. Advances in the Internet of Things (IoT) technologies that can sense and communicate data from the physical world, coupled with data analytics and Artificial intelligence (AI) that can predict usage patterns, have opened up new opportunities for organisations to lower cost and improve user experience. This thesis explores this opportunity via theory and experimentation using UNSW Sydney as a living laboratory.
Bridging the Band Gap: What Device Physicists Need to Know About Machine Learning
This article surveys the landscape of semiconductor materials and devices research for the acceleration of machine learning (ML) algorithms. We observe a disconnect between the semiconductor and device physics and engineering communities, and the digital logic and computer hardware architecture communities. The article first provides an overview of the principles of computational complexity and fundamental physical limits to computing and their relation to physical systems. The article then provides an introduction to ML by presenting three key components of ML systems: representation, evaluation, and optimisation. The article then discusses and provides examples of the application of emerging technologies from the demiconductor and device physics domains as solutions to computational problems, alongside a brief overview of emerging devices for computing applications. The article then reviews the landscape of ML accelerators, comparing fixed-function and reprogrammable digital logic with novel devices such as memristors, resistive memories, magnetic memories, and probabilistic bits. We observe broadly lower performance of ML accelerators based on novel devices and materials when compared to those based on digital complimentary metal-oxide semiconductor (CMOS) technology, particularly in the MNIST optical character recognition task, a common ML benchmark, and also highlight the lack of a trend of progress in approaches based on novel materials and devices. Lastly, the article proposes figures of merit for meaningful evaluation and comparison of different ML implementations in the hope of fostering a dialogue between the materials science, device physics, digital logic, and computer architecture communities by providing a common frame of reference for their work.
What is Deep Learning and How Will it Change Text-to-Speech?
Text-to-speech technology has advanced greatly over the past two decades. Once defined by the robotic sounding voices that they produced, text-to-speech voices today can sound just as lifelike as an actual human. Today, making a natural sounding text-to-speech voice is labor intensive and expensive. The two most popular methods, HMM and USS, require hours of recordings from a voice actor. Then, computer programmers with an understanding of linguistics must break down all of that audio into the tiniest possible pieces, called phonemes, and appropriately tag them and define the rules for when each individual unit of speech should be used.