Optical Character Recognition
Self-paced learning to improve text row detection in historical documents with missing labels
Gaman, Mihaela, Ghadamiyan, Lida, Ionescu, Radu Tudor, Popescu, Marius
An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.
Level Up Your AI Skillset and Dive Into The Deep End Of TinyML
Machine learning (ML) is a growing field, gaining popularity in academia, industry, and among makers. We will take a look at some of the available tools to help make machine learning easier, but first, let's review some of the terms commonly used in machine learning. John McCarthy provides a definition of artificial intelligence (AI) in his 2007 Stanford paper, "What is Artificial Intelligence?" In it, he says AI "is the science and engineering of making intelligent machines, especially intelligent computer programs." This definition is extremely broad, as McCarthy defines intelligence as "the computational part of the ability to achieve goals in the world." As a result, any program that achieves some goal can easily be classified as artificial intelligence. In her article "Machine Learning on Microcontrollers" (Make: Vol.
Leaks - chessie rae
Official - OCR - Convert image to text Multi Language Marks-Man submitted a new resource: [TheJavaSea] OCR - Convert image to text - Perform basic OCR (Optical Character Recognition) in English and 100 other languages. Our OCR application allows you to perform basic OCR (Optical Character Recognition) in English and 100 other languages. You must upgrade your account or reply to the thread to see the hidden content.
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Ren, Yi, Hu, Chenxu, Tan, Xu, Qin, Tao, Zhao, Sheng, Zhao, Zhou, Liu, Tie-Yan
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
Huang, Sung-Feng, Lin, Chyi-Jiunn, Liu, Da-Rong, Chen, Yi-Chen, Lee, Hung-yi
Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user's speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.
EasyOCR: A Free Open-source OCR That Supports 80+ Languages
EasyOCR is a free developer-friendly OCR "Optical Character Recognition" that supports 80 languages including Latin, Chinese, Arabic, and Cyrillic. EasyOCR is written in the Python programming language. It can be installed as a Python package, and integrates well with other Python Frameworks like Django, Flask, and others. You can test the demo here, as you can upload images in different format and test several languages. It comes with a trainer models that can be used to train for new languages, dozens of example datasets for model training, user-friendly instructions on how to train custom recognition models and more. It also supports vertical text, and PIL images, and more.
Machine Learning is the Wrong Way to Extract Data From Most Documents
Documents have spent decades stubbornly guarding their contents against software. In the late 1960s, the first OCR (optical character recognition) techniques turned scanned documents into raw text. By indexing and searching the text from these digitized documents, software sped up formerly laborious legal discovery and research projects. Today, Google, Microsoft, and Amazon provide high-quality OCR as part of their cloud services offerings. But documents remain underused in software toolchains, and valuable data languish in trillions of PDFs.
Automation Driven by Artificial Intelligence Booms in Uncertain Economic Times
Veryfi, using artificial intelligence (AI) technology to transform documents into structured data in just seconds, has announced continued strong business momentum and growth in the second quarter. As economic concerns increase, many companies begin to reduce their staff to control costs; 88 percent of job loss in routine occupations occurs within 12 months of a recession. While economic uncertainty continues, Veryfi has emerged as a trusted, reliable partner for companies seeking greater efficiency and stronger customer relationships, continuing its strong annual recurring revenue (ARR) growth. In the second quarter, Veryfi added over a dozen new logos and major accounts including a top supplier of enterprise resource planning software and one of the world's largest CRM/Direct Marketing Network companies. "As companies seek new ways to increase efficiency and manage costs to position themselves for a challenging economy, Veryfi is leading the way, applying AI to automate routine data entry and streamline business processes," said Ernest Semerda, co-founder and CEO of Veryfi.
Amazon Mechanical Turk - Wikipedia
Amazon Mechanical Turk (MTurk) is a crowdsourcing website for businesses (known as Requesters) to hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do. It is operated under Amazon Web Services, and is owned by Amazon.[1] Employers post jobs known as Human Intelligence Tasks (HITs), such as identifying specific content in an image or video, writing product descriptions, or answering questions, among others. Workers, colloquially known as Turkers or crowdworkers, browse among existing jobs and complete them in exchange for a rate set by the employer. To place jobs, the requesting programs use an open application programming interface (API), or the more limited MTurk Requester site.[2] As of April 2019, Requesters could register from only 49 approved countries.[3]