AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

Gunduz, Ahmet, Yuksel, Kamer Ali, Darwish, Kareem, Javadi, Golara, Minazzi, Fabio, Sobieski, Nicola, Bratieres, Sebastien

arXiv.org Artificial IntelligenceFeb-26-2024

Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.

annotation process, annotator, dataset, (13 more...)

arXiv.org Artificial Intelligence

2402.1638

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Rhode Island (0.04)
Europe > Greece (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.92)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.82)

Add feedback

Volume Regularization for Binary Classification

Neural Information Processing SystemsFeb-16-2024, 06:47:40 GMT

We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains simple'' weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets.

binary classification, regularization, weight vector

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.69)

Add feedback

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Łajszczak, Mateusz, Cámbara, Guillermo, Li, Yang, Beyhan, Fatih, van Korlaar, Arent, Yang, Fan, Joly, Arnaud, Martín-Cortinas, Álvaro, Abbas, Ammar, Michalski, Adam, Moinet, Alexis, Karlapati, Sri, Muszyńska, Ewa, Guo, Haohan, Putrycz, Bartosz, Gambino, Soledad López, Yoo, Kayeon, Sokolova, Elena, Drugman, Thomas

arXiv.org Artificial IntelligenceFeb-15-2024

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

arxiv preprint arxiv, representation, speechcode, (13 more...)

arXiv.org Artificial Intelligence

2402.08093

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Jordan (0.04)
Asia > Maldives (0.04)
(19 more...)

Genre:

Personal (1.00)
Research Report > Experimental Study (0.92)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Segmentation-free Connectionist Temporal Classification loss based OCR Model for Text Captcha Classification

Khatavkar, Vaibhav, Velankar, Makarand, Petkar, Sneha

arXiv.org Artificial IntelligenceFeb-8-2024

Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.

captcha, proceedings, sequence, (14 more...)

arXiv.org Artificial Intelligence

2402.05417

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Berlin (0.04)
(19 more...)

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancement of Bengali OCR by Specialized Models and Advanced Techniques for Diverse Document Types

Rabby, AKM Shahariar Azad, Ali, Hasmot, Islam, Md. Majedul, Abujar, Sheikh, Rahman, Fuad

arXiv.org Artificial IntelligenceFeb-7-2024

This research paper presents a unique Bengali OCR system with some capabilities. The system excels in reconstructing document layouts while preserving structure, alignment, and images. It incorporates advanced image and signature detection for accurate extraction. Specialized models for word segmentation cater to diverse document types, including computer-composed, letterpress, typewriter, and handwritten documents. The system handles static and dynamic handwritten inputs, recognizing various writing styles. Furthermore, it has the ability to recognize compound characters in Bengali. Extensive data collection efforts provide a diverse corpus, while advanced technical components optimize character and word recognition. Additional contributions include image, logo, signature and table recognition, perspective correction, layout reconstruction, and a queuing module for efficient and scalable processing. The system demonstrates outstanding performance in efficient and accurate text extraction and analysis.

document type, ocr system, recognition, (14 more...)

arXiv.org Artificial Intelligence

2402.05158

Country:

Asia > Singapore (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > District of Columbia > Washington (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

BanglaNet: Bangla Handwritten Character Recognition using Ensembling of Convolutional Neural Network

Saha, Chandrika, Rahman, Md Mostafijur

arXiv.org Artificial IntelligenceFeb-4-2024

Handwritten character recognition is a crucial task because of its abundant applications. The recognition task of Bangla handwritten characters is especially challenging because of the cursive nature of Bangla characters and the presence of compound characters with more than one way of writing. In this paper, a classification model based on the ensembling of several Convolutional Neural Networks (CNN), namely, BanglaNet is proposed to classify Bangla basic characters, compound characters, numerals, and modifiers. Three different models based on the idea of state-of-the-art CNN models like Inception, ResNet, and DenseNet have been trained with both augmented and non-augmented inputs. Finally, all these models are averaged or ensembled to get the finishing model. Rigorous experimentation on three benchmark Bangla handwritten characters datasets, namely, CMATERdb, BanglaLekha-Isolated, and Ekush has exhibited significant recognition accuracies compared to some recent CNN-based research. The top-1 recognition accuracies obtained are 98.40%, 97.65%, and 97.32%, and the top-3 accuracies are 99.79%, 99.74%, and 99.56% for CMATERdb, BanglaLekha-Isolated, and Ekush datasets respectively.

compound character, dataset, recognition, (15 more...)

arXiv.org Artificial Intelligence

2401.08035

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > Canada > Ontario > Middlesex County > London (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

Lyth, Dan, King, Simon

arXiv.org Artificial IntelligenceFeb-2-2024

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.

arxiv, dataset, speech, (16 more...)

arXiv.org Artificial Intelligence

2402.01912

Country: Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

MunTTS: A Text-to-Speech System for Mundari

Gumma, Varun, Hada, Rishav, Yadavalli, Aditya, Gogoi, Pamir, Mondal, Ishani, Seshadri, Vivek, Bali, Kalika

arXiv.org Artificial IntelligenceJan-28-2024

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.

international conference, mundari, speech, (17 more...)

arXiv.org Artificial Intelligence

2401.15579

Country:

Asia > Indonesia > Bali (0.05)
Asia > India > Jharkhand (0.04)
North America > United States > Maryland (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Guo, Yiwei, Du, Chenpeng, Ma, Ziyang, Chen, Xie, Yu, Kai

arXiv.org Artificial IntelligenceJan-16-2024

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.

proc, trajectory, voiceflow, (13 more...)

arXiv.org Artificial Intelligence

2309.05027

Country:

Asia > China > Shanghai > Shanghai (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Jiangsu Province (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Improving OCR Quality in 19th Century Historical Documents Using a Combined Machine Learning Based Approach

Fleischhacker, David, Goederle, Wolfgang, Kern, Roman

arXiv.org Artificial IntelligenceJan-15-2024

This paper addresses a major challenge to historical research on the 19th century. Large quantities of sources have become digitally available for the first time, while extraction techniques are lagging behind. Therefore, we researched machine learning (ML) models to recognise and extract complex data structures in a high-value historical primary source, the Schematismus. It records every single person in the Habsburg civil service above a certain hierarchical level between 1702 and 1918 and documents the genesis of the central administration over two centuries. Its complex and intricate structure as well as its enormous size have so far made any more comprehensive analysis of the administrative and social structure of the later Habsburg Empire on the basis of this source impossible. We pursued two central objectives: Primarily, the improvement of the OCR quality, for which we considered an improved structure recognition to be essential; in the further course, it turned out that this also made the extraction of the data structure possible. We chose Faster R-CNN as base for the ML architecture for structure recognition. In order to obtain the required amount of training data quickly and economically, we synthesised Hof- und Staatsschematismus-style data, which we used to train our model. The model was then fine-tuned with a smaller set of manually annotated historical source data. We then used Tesseract-OCR, which was further optimised for the style of our documents, to complete the combined structure extraction and OCR process. Results show a significant decrease in the two standard parameters of OCR-performance, WER and CER (where lower values are better). Combined structure detection and fine-tuned OCR improved CER and WER values by remarkable 71.98 percent (CER) respectively 52.49 percent (WER).

detection, padding, schematismus document, (17 more...)

arXiv.org Artificial Intelligence

2401.07787

Country:

Europe > Austria > Styria > Graz (0.05)
Europe > Central Europe (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(6 more...)

Genre: Research Report > New Finding (0.48)

Industry: Government (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)

Add feedback