Goto

Collaborating Authors

 Optical Character Recognition


Text To Speech Explained from basic

#artificialintelligence

As the title suggests, in this blog we are going to learn about text to speech (TTS) synthesis. What is the first bell which rings in your mind when you listen to text to speech? For me, it's Alexa, Google Home, Siri, and many other conversational bots that are on an exponential rise currently. Advances in deep learning research have helped us to generate human-like voices, so let's see how we can use that. I'll start with a few definitions, but if you want to understand these more then read this blog first.


Microsoft is testing Xbox party chat accessibility features

Engadget

Microsoft has announced that speech transcription and text-to-speech synthesis is coming to Xbox party chat, starting today for Xbox Insiders. The new features will make it easier for players with hearing or speech difficulties to participate in party chat and are part of an Xbox initiative to improve accessibility. Both features can be found in the "ease of access" tab under "game and chat transcription." With speech-to-text transcription, words spoken in a party are converted into text displayed in an adjustable overlay, as shown above. With text-to-speech enabled, anything you type into party text chat will be ready by a synthetic voice to the rest of the party, with a choice of several voices per language.


Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

arXiv.org Machine Learning

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We will make the code publicly available shortly.


This $4 Mac app extracts text from images and videos for you

Engadget

If you've ever gone through the painstaking process of transcribing text from a video, or begrudgingly typing up the copy from an image, you know the struggle. Not only is this a tedious activity, also it's prone to human error and a total time waster, to boot. Leave the manual work behind and join the thousands of Mac users who simplify their workflows with TextSniper, on sale now for just $4. TextSniper's optical character recognition (OCR) software works fast to detect any text from your screen, whether that's screenshots, images, videos, PDFs or digital documents. Instead of pouring over, say, a video, you'll be able to instantly convert that speech into text. Then, you're a simple copy-and-paste away from dropping the content into your notes, messaging app and anywhere else you please.


Gartner says low-code, RPA, and AI driving growth in 'hyperautomation'

#artificialintelligence

Research firm Gartner estimates the market for hyperautomation-enabling technologies will reach $596 billion in 2022, up nearly 24% from the $481.6 billion in 2020. Gartner is expecting significant growth for technology that enables organizations to rapidly identify, vet, and automate as many processes as possible and says it will become a "condition of survival" for enterprises. Hyperautomation-enabling technologies include robotic process automation (RPA), low-code application platforms (LCAP), AI, and virtual assistants. As organizations look for ways to automate the digitization and structuring of data and content, technologies that automate content ingestion, such as signature verification tools, optical character recognition, document ingestion, conversational AI, and natural language technology (NLT), will be in high demand. For example, these tools could be used to automate the process of digitizing and sorting paper records.


10 Cool Cloud AI And ML Services You Need To Know About

#artificialintelligence

Google Cloud's machine learning-powered Document AI platform -- which already has been used to process tens of billions of pages of documents for government agencies and the lending and insurance industries among others -- became generally available last week, along with Lending DocAI and Procurement DocAI. The serverless Document AI platform is a unified console for document processing that allows users to quickly access Google Cloud's form, table and invoice parsers, tools and offerings -- including Procurement DocAI and Lending DocAI -- with a unified API. It uses artificial intelligence/machine learning (AI/ML) to classify, extract and enrich data from scanned and digital documents at scale, including structured data from unstructured documents, making it easier to understand and analyze. Doc AI solutions feature Google technologies including computer vision, optical character recognition and natural language processing, which create pre-trained models for high-value and -volume documents, and Google Knowledge Graph to validate and enhance fields in documents. Research and advisory firm Gartner predicts AI will be the top category that determines IT infrastructure decisions by 2025, driving a tenfold growth in compute requirements. Half of all enterprises will have AI orchestration platforms by 2025 to operationalize AI, according to Gartner, up from less than 10 percent in 2020.


Bidding adieu to manual document processing

#artificialintelligence

Traditional document processing units required staff members to manually read and key in relevant information from purchase orders, quotes, invoices, remittances and other documents โ€“ every day, year after year. This process lowers both staff morale and productivity, and often leads to unwanted errors and increased costs. Intelligent document processing (IDP) is a next-generation approach that uses automation to quickly extract information from business documents. Here are 10 things you need to know about IDP and how it can enable end-to-end process automation for your organization. The first wave of IDP was driven by template-based optical character recognition (OCR) technology.


OCR: Give Eyes to Your Chatbot

#artificialintelligence

We live in a world where robots are increasingly common. There are chatbots on the company websites and machines that build cars and other equipment by themselves. More and more are the tasks that these agents can perform, and OCR is one of them. This article tells you what OCR is, its applications, and how your company's chatbots can use it. OCR stands for Optical Character Recognition.


Google Photos finally lets PC users copy text from an image

PCWorld

One of the most handy Google Photos features just landed on the desktop--via your browser--where it could be even more valuable. The mobile version of Photos supports a technology called Google Lens. In 2018, Lens introduced Optical Character Recognition (OCR) technology that can automatically copy any text found in an image, allowing you to paste it elsewhere for easy saving. As 9to5Google spotted over the weekend, that Lens OCR feature is now rolling out to desktop browsers, and that rocks. Enabling OCR in Google Photos makes it easy-peasy to take a picture of a document, book, or anything else on your phone, open it in your browser, and quickly copy its contents into an Office file.


Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

arXiv.org Artificial Intelligence

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.