Goto

Collaborating Authors

 Optical Character Recognition


State of Optical Character Recognition in 2022 part1(Artificial Intelligence)

#artificialintelligence

Abstract: Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision.


Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

arXiv.org Artificial Intelligence

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.


AI Inference Software Fundamentals: Getting Started with Optical Character Recognition

#artificialintelligence

You can find the full source code to today's demo in a Kaggle notebook where it is formatted as a series of very short, numbered blocks. For the sake of brevity, this post will walk through only the most significant snippets of the notebook's code. But, of course, you can study the full notebook at your leisure by the block number and learn how we trained a neural network from scratch to achieve a level of accuracy not possible a decade ago. In blocks 1 to 3, the notebook sets the Python environment for TensorFlow. In blocks 4 to 14, the notebook loads the database MNIST, which is what we will use to create a model that can recognize handwritten digits and train our neural networks. Then the new and exciting part Intel offers today is how these models can be optimized on Intel hardware to run more efficiently and quickly.


This text-to-speech converter is on sale for 50% off

#artificialintelligence

TL;DR: A lifetime subscription to Micmonster AI Voiceovers(opens in a new tab) is on sale for ยฃ53.31, saving you 50% on list price. Whether you're a content creator, website manager, YouTuber, or a web marketer, it could benefit you to learn how to do voiceover for videos. You may be able to record yourself, but there's likely a finite amount of voice variation you have at your disposal, and it's time-consuming. However, an AI tool has fewer limits. With Micmonster AI Voiceovers(opens in a new tab), you can hear your text read in 500 voices, and it's just ยฃ53.31 for a lifetime subscription -- the best price you'll find on the internet.


What Is Hyperautomation?

#artificialintelligence

Gartner has anointed "Hyperautomation" one of the top 10 trends for 2022. Is it a real trend, or just a collection of buzzwords? As a trend, it's not performing well on Google; it shows little long-term growth, if any, and gets nowhere near as many searches as terms like "Observability" and "Generative Adversarial Networks." And it's never bubbled up far enough into our consciousness to make it into our monthly Trends piece. However, that skeptical conclusion is too simplistic. Hyperautomation may just be another ploy in the game of buzzword bingo, but we need to look behind the game to discover what's important. There seems to be broad agreement that hyperautomation is the combination of Robotic Process Automation with AI. Natural language generation and natural language understanding are frequently mentioned, too, but they're subsumed under AI. So is optical character recognition (OCR)โ€“something that's old hat now, but is one of the first successful applications of AI. Using AI to discover tasks that can be automated also comes up frequently. While we don't find the multiplication of buzzwords endearing, it's hard to argue that adding AI to anything is uninterestingโ€“and specifically adding AI to automation. Get a free trial today and find answers on the fly, or master something new and useful.


GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

arXiv.org Artificial Intelligence

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at https://GenerSpeech.github.io/



Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

arXiv.org Artificial Intelligence

Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus, which is troublesome to collect. In this paper, we propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training. By leveraging wav2vec2.0 representation, unlabeled speech can highly improve performance, especially in the lack of labeled speech. We also extend the proposed method to zero-shot multi-speaker TTS (ZS-TTS). The experimental results verify the effectiveness of the proposed method in terms of naturalness, intelligibility, and speaker generalization. We highlight that the single speaker TTS model fine-tuned on the only 10 minutes of labeled dataset outperforms the other baselines, and the ZS-TTS model fine-tuned on the only 30 minutes of single speaker dataset can generate the voice of the arbitrary speaker, by pre-training on unlabeled multi-speaker speech corpus.


More businesses need to use AI

#artificialintelligence

As a startup which has been operational for five years, specializing in conversational artificial intelligence (AI), Vbee is a pioneer in providing services such as artificial voice (vbee.vn) However, the path to bringing AI to reality is still tough. First, businesses must be persuaded to apply new technological solutions to improve productivity and reduce costs. Vnee has many solutions such as KYC (Know Your Customer), artificial switchboard, artificial voice, artificial MC, OCR (optical character recognition), voice biometrics, chatbot, call bot and artificial virtual assistant, packaged and ready to be used. But businesses are hesitant to use them.


Chandojnanam: A Sanskrit Meter Identification and Utilization System

arXiv.org Artificial Intelligence

We present Chandoj\~n\=anam, a web-based Sanskrit meter (Chanda) identification and utilization system. In addition to the core functionality of identifying meters, it sports a friendly user interface to display the scansion, which is a graphical representation of the metrical pattern. The system supports identification of meters from uploaded images by using optical character recognition (OCR) engines in the backend. It is also able to process entire text files at a time. The text can be processed in two modes, either by treating it as a list of individual lines, or as a collection of verses. When a line or a verse does not correspond exactly to a known meter, Chandoj\~n\=anam is capable of finding fuzzy (i.e., approximate and close) matches based on sequence matching. This opens up the scope of a meter-based correction of erroneous digital corpora. The system is available for use at https://sanskrit.iitk.ac.in/jnanasangraha/chanda/, and the source code in the form of a Python library is made available at https://github.com/hrishikeshrt/chanda/.