Goto

Collaborating Authors

 Optical Character Recognition


Iterative Learning for Reliable Crowdsourcing Systems

Neural Information Processing Systems

Crowdsourcing systems, in which tasks are electronically distributed to numerous "information piece-workers", have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker.


Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

arXiv.org Artificial Intelligence

These methods typically involve largescale Scene-Text Visual Question Answering (ST-VQA) aims to pretraining followed by fine-tuning to adapt the model for understand scene text in images and answer questions related question-answering tasks in text-rich scene images, often ignoring to the text content. Most existing methods heavily rely on the the inevitable OCR text recognition challenges. In practice, accuracy of Optical Character Recognition (OCR) systems, scene images may exhibit phenomena such as blurring, and aggressive fine-tuning based on limited spatial location distortion, skewness, or uneven lighting, leading to erroneous information and erroneous OCR text information often leads character recognition by OCR systems, especially in cases to inevitable overfitting. In this paper, we propose a multimodal of low-quality handwriting. Even when OCR systems correctly adversarial training architecture with spatial awareness identify characters, discrete and semantically irrelevant capabilities. Specifically, we introduce an Adversarial OCR recognition results may impact the comprehension of the OCR Enhancement (AOE) module, which leverages adversarial text semantics.


Calibrated Structured Prediction

Neural Information Processing Systems

In user-facing applications, displaying calibrated confidence measures-- probabilities that correspond to true frequency--can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.


How to save papers, photos, and analog music digitally

PCWorld

Do you, like me, have paper documents that have long since been scanned and processed, records or music cassettes that you would like to listen to on your mobile phone, and photo prints that are planned for a digital photo book? Then you will appreciate the two-step instructions in this article, with which you can convert analog media to digital and then process them further. Important insurance papers, contracts, invoices, or simply the page-long letter from your favorite aunt -- there are many paper documents that you want to scan in order to preserve them. If it's even a text that you want to search and edit, you can run OCR software over it after scanning, which recognizes the text so that you can search it and, if necessary, edit it with a standard word processor. With the freeware Not Another PDF Scanner 2 (Naps 2), you have plenty of options for editing and saving the scan after scanning a document.


LOCR: Location-Guided Transformer for Optical Character Recognition

arXiv.org Artificial Intelligence

Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.


An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation

arXiv.org Artificial Intelligence

Data availability is crucial for advancing artificial intelligence applications, including voice-based technologies. As content creation, particularly in social media, experiences increasing demand, translation and text-to-speech (TTS) technologies have become essential tools. Notably, the performance of these TTS technologies is highly dependent on the quality of the training data, emphasizing the mutual dependence of data availability and technological progress. This paper introduces an end-to-end tool to generate high-quality datasets for text-to-speech (TTS) models to address this critical need for high-quality data. The contributions of this work are manifold and include: the integration of language-specific phoneme distribution into sample selection, automation of the recording process, automated and human-in-the-loop quality assurance of recordings, and processing of recordings to meet specified formats. The proposed application aims to streamline the dataset creation process for TTS models through these features, thereby facilitating advancements in voice-based technologies.


Volume Regularization for Binary Classification

Neural Information Processing Systems

We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains simple'' weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets.


BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

arXiv.org Artificial Intelligence

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.


Segmentation-free Connectionist Temporal Classification loss based OCR Model for Text Captcha Classification

arXiv.org Artificial Intelligence

Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.


Enhancement of Bengali OCR by Specialized Models and Advanced Techniques for Diverse Document Types

arXiv.org Artificial Intelligence

This research paper presents a unique Bengali OCR system with some capabilities. The system excels in reconstructing document layouts while preserving structure, alignment, and images. It incorporates advanced image and signature detection for accurate extraction. Specialized models for word segmentation cater to diverse document types, including computer-composed, letterpress, typewriter, and handwritten documents. The system handles static and dynamic handwritten inputs, recognizing various writing styles. Furthermore, it has the ability to recognize compound characters in Bengali. Extensive data collection efforts provide a diverse corpus, while advanced technical components optimize character and word recognition. Additional contributions include image, logo, signature and table recognition, perspective correction, layout reconstruction, and a queuing module for efficient and scalable processing. The system demonstrates outstanding performance in efficient and accurate text extraction and analysis.