AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

EraseNet: A Recurrent Residual Network for Supervised Document Cleaning

Shinde, Yashowardhan, Kulkarni, Kishore, Kuberkar, Sachin

arXiv.org Artificial IntelligenceJul-4-2023

Document denoising is considered one of the most challenging tasks in computer vision. There exist millions of documents that are still to be digitized, but problems like document degradation due to natural and man-made factors make this task very difficult. This paper introduces a supervised approach for cleaning dirty documents using a new fully convolutional auto-encoder architecture. This paper focuses on restoring documents with discrepancies like deformities caused due to aging of a document, creases left on the pages that were xeroxed, random black patches, lightly visible text, etc., and also improving the quality of the image for better optical character recognition system (OCR) performance. Removing noise from scanned documents is a very important step before the documents as this noise can severely affect the performance of an OCR system. The experiments in this paper have shown promising results as the model is able to learn a variety of ordinary as well as unusual noises and rectify them efficiently.

artificial intelligence, machine learning, recognition, (15 more...)

arXiv.org Artificial Intelligence

2210.00708

Country:

Asia > India > Maharashtra > Pune (0.05)
North America > United States > California > Orange County > Irvine (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Estimating Post-OCR Denoising Complexity on Numerical Texts

Hemmer, Arthur, Brachat, Jérôme, Coustaty, Mickaël, Ogier, Jean-Marc

arXiv.org Artificial IntelligenceJul-3-2023

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.0102

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > France (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)

Add feedback

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Liu, Sen, Guo, Yiwei, Du, Chenpeng, Chen, Xie, Yu, Kai

arXiv.org Artificial IntelligenceJun-25-2023

Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.

artificial intelligence, machine learning, optical character recognition, (18 more...)

arXiv.org Artificial Intelligence

2306.14145

Country:

Asia > China > Shanghai > Shanghai (0.05)
North America > Canada > Quebec > Montreal (0.04)
Asia > China > Jiangsu Province (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

Resume Information Extraction via Post-OCR Text Processing

Helli, Selahattin Serdar, Tanberk, Senem, Cavsak, Sena Nur

arXiv.org Artificial IntelligenceJun-23-2023

Information extraction (IE), one of the main tasks of natural language processing (NLP), has recently increased importance in the use of resumes. In studies on the text to extract information from the CV, sentence classification was generally made using NLP models. In this study, it is aimed to extract information by classifying all of the text groups after pre-processing such as Optical Character Recognition (OCT) and object recognition with the YOLOv8 model of the resumes. The text dataset consists of 286 resumes collected for 5 different (education, experience, talent, personal and language) job descriptions in the IT industry. The dataset created for object recognition consists of 1198 resumes, which were collected from the open-source internet and labeled as sets of text. BERT, BERT-t, DistilBERT, RoBERTa and XLNet were used as models. F1 score variances were used to compare the model results. In addition, the YOLOv8 model has also been reported comparatively in itself. As a result of the comparison, DistilBERT was showed better results despite having a lower number of parameters than other models.

data mining, machine learning, pattern recognition, (17 more...)

arXiv.org Artificial Intelligence

2306.13775

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Macao (0.04)
(3 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.87)
Information Technology > Data Science > Data Mining > Text Mining (0.71)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.71)
(4 more...)

Add feedback

Chrome can soon convert PDFs into text it can read aloud

EngadgetJun-22-2023, 15:44:28 GMT

Google will soon make it easier to interact with PDFs if you have low vision. The company is adding OCR (optical character recognition) technology to Chrome that can convert PDFs to text that makes them more accessible, particularly if you want a screen reader to read them aloud. The tool will also provide image descriptions. The feature will be available in the "coming months," Google says. The company also plans to expand the functionality beyond Chrome later this year, although it hasn't said which platforms might receive the upgrade.

chrome, google, read aloud, (2 more...)

Engadget

Country: North America > United States (0.08)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.61)
Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Visual-Aware Text-to-Speech

Zhou, Mohan, Bai, Yalong, Zhang, Wei, Yao, Ting, Zhao, Tiejun, Mei, Tao

arXiv.org Artificial IntelligenceJun-21-2023

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

artificial intelligence, machine learning, optical character recognition, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP49357.2023.10095084

2306.1202

Country:

North America > Canada > Quebec > Montreal (0.05)
Asia > China > Heilongjiang Province > Harbin (0.05)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)

Add feedback

When Vision Fails: Text Attacks Against ViT and OCR

Boucher, Nicholas, Blessing, Jenny, Shumailov, Ilia, Anderson, Ross, Papernot, Nicolas

arXiv.org Artificial IntelligenceJun-12-2023

While text-based machine learning models that operate on visual inputs of rendered text have become robust against a wide range of existing attacks, we show that they are still vulnerable to visual adversarial examples encoded as text. We use the Unicode functionality of combining diacritical marks to manipulate encoded text so that small visual perturbations appear when the text is rendered. We show how a genetic algorithm can be used to generate visual adversarial examples in a black-box setting, and conduct a user study to establish that the model-fooling adversarial examples do not affect human comprehension. We demonstrate the effectiveness of these attacks in the real world by creating adversarial examples against production models published by Facebook, Microsoft, IBM, and Google.

adversarial example, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2306.07033

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
(3 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.50)
(2 more...)

Add feedback

Generating Multilingual Gender-Ambiguous Text-to-Speech Voices

Markopoulos, Konstantinos, Maniati, Georgia, Vamvoukakis, Georgios, Ellinas, Nikolaos, Vardaxoglou, Georgios, Kakoulidis, Panos, Oh, Junkwang, Jho, Gunu, Hwang, Inchul, Chalamandaris, Aimilios, Tsiakoulis, Pirros, Raptis, Spyros

arXiv.org Artificial IntelligenceJun-11-2023

The gender of any voice user interface is a key element of its perceived identity. Recently, there has been increasing interest in interfaces where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating novel gender-ambiguous TTS voices in a multi-speaker, multilingual setting. This is accomplished by efficiently sampling from a latent speaker embedding space using a proposed gender-aware method. Extensive objective and subjective evaluations clearly indicate that this method is able to efficiently generate a range of novel, diverse voices that are consistent and perceived as more gender-ambiguous than a baseline voice across all the languages examined. Interestingly, the gender perception is found to be robust across two demographic factors of the listeners: native language and gender. To our knowledge, this is the first systematic and validated approach that can reliably generate a variety of gender-ambiguous voices.

artificial intelligence, machine learning, optical character recognition, (17 more...)

arXiv.org Artificial Intelligence

2211.00375

Country: Europe > Greece (0.04)

Genre:

Research Report (0.64)
Questionnaire & Opinion Survey (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.52)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.51)

Add feedback

Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech

Wang, Shijun, Guðnason, Jón, Borth, Damian

arXiv.org Artificial IntelligenceJun-9-2023

Effective speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. However, emotional speech samples are more difficult and expensive to acquire compared with Neutral style speech, which causes one issue that most related works unfortunately neglect: imbalanced datasets. Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations. In this paper, we propose an Emotion Extractor to address this issue. We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets. Our empirical results show that (1) for the SER task, the proposed Emotion Extractor surpasses the state-of-the-art baseline on three imbalanced datasets; (2) the produced representations from our Emotion Extractor benefit the TTS model, and enable it to synthesize more expressive speech.

artificial intelligence, machine learning, optical character recognition, (16 more...)

arXiv.org Artificial Intelligence

2306.05709

Country:

Europe > Switzerland > St. Gallen > St. Gallen (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.86)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
(2 more...)

Add feedback

The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech

Do, Phat, Coler, Matt, Dijkstra, Jelske, Klabbers, Esther

arXiv.org Artificial IntelligenceJun-1-2023

We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.

articulatory feature, lrl, phone label, (16 more...)

arXiv.org Artificial Intelligence

2306.00535

Country:

Europe > Netherlands (0.05)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
North America > Canada > Quebec > Montreal (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.63)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback