AITopics

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.63)
Information Technology > Artificial Intelligence > Assistive Technologies (0.63)

Hsieh, Cheng-Ping, Ghosh, Subhankar, Ginsburg, Boris

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

arXiv.org Artificial IntelligenceNov-1-2022

Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.

artificial intelligence, machine learning, module, (17 more...)

2211.00585

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

De Nardin, Axel, Zottin, Silvia, Paier, Matteo, Foresti, Gian Luca, Colombi, Emanuela, Piciarelli, Claudio

Efficient few-shot learning for pixel-precise handwritten document layout analysis

arXiv.org Artificial IntelligenceOct-27-2022

Layout analysis is a task of uttermost importance in ancient handwritten document analysis and represents a fundamental step toward the simplification of subsequent tasks such as optical character recognition and automatic transcription. However, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm. While these systems achieve very good performance on this task, the drawback is that pixel-precise text labeling of the entire training set is a very time-consuming process, which makes this type of information rarely available in a real-world scenario. In the present paper, we address this problem by proposing an efficient few-shot learning framework that achieves performances comparable to current state-of-the-art fully supervised methods on the publicly available DIVA-HisDB dataset.

artificial intelligence, machine learning, pixel-precise handwritten document layout analysis, (2 more...)

doi: 10.1109/WACV56688.2023.00367

2210.1557

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)

arXiv.org Artificial IntelligenceOct-27-2022

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

Morioka, Nobuyuki, Zen, Heiga, Chen, Nanxin, Zhang, Yu, Ding, Yifan

Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters of a pretrained multi-speaker backbone model. However, serving hundreds of fine-tuned neural TTS models is expensive as each of them requires significant footprint and separate computational resources (e.g., accelerators, memory). To scale speaker adapted neural TTS voices to hundreds of speakers while preserving the naturalness and speaker similarity, this paper proposes a parameter-efficient few-shot speaker adaptation, where the backbone model is augmented with trainable lightweight modules called residual adapters. This architecture allows the backbone model to be shared across different target speakers. Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches, while requiring only $\sim$0.1% of the backbone model parameters for each speaker.

artificial intelligence, machine learning, natural language, (20 more...)

2210.15868

Country:

North America > United States (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
(3 more...)

#artificialintelligenceOct-25-2022, 22:15:44 GMT

Image Analysis 4.0 with new API endpoint and OCR model in preview

Enterprises and hobbyists alike have been using Azure Computer Vision's Image Analysis API to garner various insights from their images. These insights help power scenarios such as digital asset management, search engine optimization (SEO), image content moderation, and alt text for accessibility among others. We are thrilled to announce the preview release of Computer Vision Image Analysis 4.0 which combines existing and new visual features such as read optical character recognition (OCR), captioning, image classification and tagging, object detection, people detection, and smart cropping into one API. One call is all it takes to run all these features on an image. The OCR feature integrates more deeply with the Computer Vision service and includes performance improvements that are optimized for image scenarios that make OCR easy to use for user interfaces and near real-time experiences.

api endpoint and ocr model, image analysis 4, preview, (9 more...)

Industry: Information Technology > Security & Privacy (0.33)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.59)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.57)

#artificialintelligenceOct-22-2022, 07:05:14 GMT

State of Optical Character Recognition in 2022 part1(Artificial Intelligence)

Abstract: Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision.

artificial intelligence, optical character recognition, text recognition, (11 more...)

Genre: Research Report (0.54)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.57)

Lux, Florian, Koch, Julia, Vu, Ngoc Thang

Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

arXiv.org Artificial IntelligenceOct-21-2022

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

machine learning, natural language, prosody, (20 more...)

2206.12229

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)
(2 more...)

#artificialintelligenceOct-19-2022, 23:05:36 GMT

AI Inference Software Fundamentals: Getting Started with Optical Character Recognition

You can find the full source code to today's demo in a Kaggle notebook where it is formatted as a series of very short, numbered blocks. For the sake of brevity, this post will walk through only the most significant snippets of the notebook's code. But, of course, you can study the full notebook at your leisure by the block number and learn how we trained a neural network from scratch to achieve a level of accuracy not possible a decade ago. In blocks 1 to 3, the notebook sets the Python environment for TensorFlow. In blocks 4 to 14, the notebook loads the database MNIST, which is what we will use to create a model that can recognize handwritten digits and train our neural networks. Then the new and exciting part Intel offers today is how these models can be optimized on Intel hardware to run more efficiently and quickly.

artificial intelligence, machine learning, optical character recognition, (14 more...)

Country: North America > United States (0.15)

Industry: Law (0.51)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceOct-15-2022, 04:40:52 GMT

This text-to-speech converter is on sale for 50% off

TL;DR: A lifetime subscription to Micmonster AI Voiceovers(opens in a new tab) is on sale for £53.31, saving you 50% on list price. Whether you're a content creator, website manager, YouTuber, or a web marketer, it could benefit you to learn how to do voiceover for videos. You may be able to record yourself, but there's likely a finite amount of voice variation you have at your disposal, and it's time-consuming. However, an AI tool has fewer limits. With Micmonster AI Voiceovers(opens in a new tab), you can hear your text read in 500 voices, and it's just £53.31 for a lifetime subscription -- the best price you'll find on the internet.

lifetime subscription, micmonster ai voiceover, text-to-speech converter, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.43)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.43)
Information Technology > Artificial Intelligence > Assistive Technologies (0.43)

#artificialintelligenceOct-12-2022, 20:21:58 GMT

What Is Hyperautomation?

Gartner has anointed "Hyperautomation" one of the top 10 trends for 2022. Is it a real trend, or just a collection of buzzwords? As a trend, it's not performing well on Google; it shows little long-term growth, if any, and gets nowhere near as many searches as terms like "Observability" and "Generative Adversarial Networks." And it's never bubbled up far enough into our consciousness to make it into our monthly Trends piece. However, that skeptical conclusion is too simplistic. Hyperautomation may just be another ploy in the game of buzzword bingo, but we need to look behind the game to discover what's important. There seems to be broad agreement that hyperautomation is the combination of Robotic Process Automation with AI. Natural language generation and natural language understanding are frequently mentioned, too, but they're subsumed under AI. So is optical character recognition (OCR)–something that's old hat now, but is one of the first successful applications of AI. Using AI to discover tasks that can be automated also comes up frequently. While we don't find the multiplication of buzzwords endearing, it's hard to argue that adding AI to anything is uninteresting–and specifically adding AI to automation. Get a free trial today and find answers on the fly, or master something new and useful.

application, automation, hyperautomation, (17 more...)

Industry:

Health & Medicine (0.94)
Banking & Finance (0.93)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.54)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)