Collaborating Authors

Optical Character Recognition

Text to Speech Technology: How Voice Computing is Building a More Accessible World


In a world where new technology emerges at exponential rates, and our daily lives are increasingly mediated by speakers and sound waves, text to speech technology is the latest force evolving the way we communicate. Text to speech technology refers to a field of computer science that enables the conversion of language text into audible speech. Also known as voice computing, text to speech (TTS) often involves building a database of recorded human speech to train a computer to produce sound waves that resemble the natural sound of a human speaking. This process is called speech synthesis. The technology is trailblazing and major breakthroughs in the field occur regularly.

r/MachineLearning - [2006.04558] FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech


Abstract: Advanced text-to-speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs during training and use predicted values during inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of full end-to-end training and even faster inference than FastSpeech.

The AI Pinball Player That Could Beat Humans Within 4 Days


Developers have taught artificial intelligence how to play an arcade pinball machine, which learned so quickly it could beat human players within four days. Speaking at Microsoft's developer conference, Build, which is being held virtually this week, Jack Skinner described how he and a team of developers in Sydney used artificial intelligence to control an actual pinball machine. The team took a regular arcade machine and adapted it, using a Windows computer to control the AI software, and a Raspberry Pi to control the flipper mechanism within the pinball machine. Two webcams were mounted on the pinball machine - one pointed at the scoreboard and one pointed down at the table - so that the AI could "see" the table like a human player would. Optical character recognition (OCR) software allowed the computer to read the current score from the pinball machine's electronic display.

AI-Powered Biotech Can Help Deploy a Vaccine In Record Time


The magnitude of the Covid-19 pandemic will largely depend on how quickly safe and effective vaccines and treatments can be developed and tested. Many assume a widely available vaccine is years away, if ever. Others believe that a 12- to 18-month development cycle is a given. Our best bet to reduce even that record-breaking timeline is by using artificial intelligence. The problem is twofold: discovering the right set of molecules among billions of possibilities, and then waiting for clinical trials. These processes ordinarily take several years, but AI holds the key to radically shortening both.

Judge Dismisses Lawsuit Over Mail Delivery

U.S. News

The apartment complexes near Western Kentucky University sued the United States Postal Service and a postmaster in January after the agency began delivering mail in bulk to property management offices instead of tenants' mailboxes. The change came after the Postal Service reclassified the residences as dormitories, according to the lawsuit.

Shape Context descriptor and fast characters recognition


Matching shapes can be much difficult task then just matching images, for example recognition of hand-written text, or fingerprints. Because most of shapes that we trying to match is heavy augmented. I can bet that you will never write to identical letters for all your life. And look at this from the point of people detection algorithm based on handwriting matching -- it would be just hell. Of course in the age of Neural networks and RNNs it also can be solved in a different way then just straight mathematics, but not always you can use heavy and memory hungry things like NNs.

How to Use Optical Character Recognition for Security System Development


Applying machine learning techniques to security solutions is one of the current AI trends. This article will cover the approach to developing OCR-based software using deep learning algorithms. This software can be used to analyze and process identification such as a US driver's license as part of a security system for verifying identity. OCR (Optical Character Recognition) technology is already used by machine learning companies for business processes automation and optimization, with use cases ranging from Dropbox using it to parse through pictures to Google Street view identifying different street signs to searching through text messages and translating text in real time. In this particular case, OCR can be used as part of an automated biometric verification system.

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing Systems

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.

Utopia Global Releases Cloud-Based Intelligent Data Capture and Control Software Platform Delivers High Quality Enriched Asset Master Data Leveraging Machine Learning


IDCC uniquely leverages optical character recognition, Utopia's advanced machine learning code, intelligent online web search, and document search. Beginning simply with only a photo of a manufacturer's nameplate, IDCC can produce complete and accurate material and asset information. Manufacturer and model data is organized in ISO-14224 standards and can be delivered via a variety of easy-to-integrate methods, including SAP Asset Intelligence Network . The cloud-based nature of IDCC enables cost-effective, rapid deployments by large and small organizations alike. IDCC can be deployed in pure cloud environments, such as SAP Intelligent Asset Management, or hybrid deployments using SAP Master Data Governance, enterprise asset management extension by Utopia.

Introducing The AI Reading Machine That Reconstructs Books As Illustrated Haikus


The dynamic design duo of Karen Ann Donnachie and Andy Simionato are set to receive the Tokyo Type Directors Club award for their AI reading machine project – a machine that essentially transforms books into short Haikus accompanied by related images. It does this by using computer vision and optical character recognition to'read' books. Then with machine learning and natural language processing, it selects a poetic combination of words while erasing the rest to form an artsy-looking Haiku. While doing this, the reading machine also using Google to search up images that relate to said words. Donnachie and Simionato have released a series of books that we know and love with a slight twist.