Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.
Text to speech (TTS) has attracted a lot of attention recently due to advancements in deep learning. Neural network-based TTS models (such as Tacotron 2, DeepVoice 3 and Transformer TTS) have outperformed conventional concatenative and statistical parametric approaches in terms of speech quality. Neural network-based TTS models usually first generate a mel-scale spectrogram (or mel-spectrogram) autoregressively from text input and then synthesize speech from the mel-spectrogram using a vocoder. A spectrogram is a visual representation of frequencies measured over time.) To address the above problems, researchers from Microsoft and Zhejiang University propose FastSpeech, a novel feed-forward network that generates mel-spectrograms with fast generation speed, robustness, controllability, and high quality.
We introduce a large-volume box classification for binary prediction, which maintains a subset of weight vectors, and specifically axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains simple'' weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efficiently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of $30$ NLP datasets and binarized USPS optical character recognition datasets.
Analytic shrinkage is a statistical technique that offers a fast alternative to cross-validation for the regularization of covariance matrices and has appealing consistency properties. We show that the proof of consistency implies bounds on the growth rates of eigenvalues and their dispersion, which are often violated in data. We prove consistency under assumptions which do not restrict the covariance structure and therefore better match real world data. In addition, we propose an extension of analytic shrinkage --orthogonal complement shrinkage-- which adapts to the covariance structure. Finally we demonstrate the superior performance of our novel approach on data from the domains of finance, spoken letter and optical character recognition, and neuroscience.
Kaspersky researchers recently found malware in an app called CamScanner, a phone-based PDF creator that includes OCR (optical character recognition) and has more than 100 million downloads in Google Play. Various resources call the app by slightly different names such as CamScanner -- Phone PDF Creator and CamScanner-Scanner to scan PDFs. Official app stores such as Google Play are usually considered a safe haven for downloading software. Unfortunately, nothing is 100% safe, and from time to time malware distributors manage to sneak their apps into Google Play. The problem is that even such a powerful company as Google can't thoroughly check millions of apps.
In user-facing applications, displaying calibrated confidence measures---probabilities that correspond to true frequency---can be as important as obtaining high accuracy. We are interested in calibration for structured prediction problems such as speech recognition, optical character recognition, and medical diagnosis. Structured prediction presents new challenges for calibration: the output space is large, and users may issue many types of probability queries (e.g., marginals) on the structured output. We extend the notion of calibration so as to handle various subtleties pertaining to the structured setting, and then provide a simple recalibration method that trains a binary classifier to predict probabilities of interest. We explore a range of features appropriate for structured recalibration, and demonstrate their efficacy on three real-world datasets.
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.
ABSTRACT We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a V ariational AutoEncoder (V AE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to- Speech, while also improving the perceived emotional intensity ( 59 compared to the 55 of neutral speech).
Organizations in all industries have a large number of physical documents. It can be difficult to extract text from a scanned document when it contains formats such as tables, forms, paragraphs, and check boxes. Organizations have been addressing these problems with Optical Character Recognition (OCR) technology, but it requires templates for form extraction and custom workflows. Extracting and analyzing text from images or PDFs is a classic machine learning (ML) and natural language processing (NLP) problem. When extracting the content from a document, you want to maintain the overall context and store the information in a readable and searchable format.
At BoTree Technologies, we leverage the power of AI to help companies make the most of Robotic Process Automation Technology and enhance overall efficiency. With our years of experience, we have built tools and algorithms to automate several repetitive tasks. Here are the steps through which our Robotic Process Automation tasks are executed. At BoTree, the robotic automation process begins with extracting information from structured and unstructured sources like PDFs, images, e-mails, excel sheets, and several others. This step is followed by processing the data inputs through various methods including Optical Character Recognition.