AITopics | Optical Character Recognition

Collaborating Authors

Optical Character Recognition

Our second example deals with a more challenging problem: the recognition of hand-printed letters of the alphabet. The characters that people print in the ordinary course of filling out forms and questionnaires are surprisingly varied. Gaps abound wherecontinuous lines might be expected; curves and sharp angles appear interchangeably; there is almost every imaginable distortion of slant, shape and size. Even human readers cannot always identify such characters; their error rate is about 3 per cent on randomly selected letters and numbers, seen out of context.
– from Oliver G. Selfridge & Ulric Neisser. PATTERN RECOGNITION BY MACHINE . In Computers & thought, Edward A. Feigenbaum and Julian Feldman (Eds.). MIT Press, Cambridge, MA, USA, 1963. pp. 8-30.

News Overviews Instructional Materials AI-Alerts Classics

ChartKG: A Knowledge-Graph-Based Representation for Chart Images

Zhou, Zhiguang, Wang, Haoxuan, Zhao, Zhengqing, Zheng, Fengling, Wang, Yongheng, Chen, Wei, Wang, Yong

arXiv.org Artificial IntelligenceOct-13-2024

Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images into raw data and often ignore their visual encodings and semantic meanings, which can result in information loss for many downstream tasks. In this paper, we propose ChartKG, a novel knowledge graph (KG) based representation for chart images, which can model the visual elements in a chart image and semantic relations among them including visual encodings and visual insights in a unified manner. Further, we develop a general framework to convert chart images to the proposed KG-based representation. It integrates a series of image processing techniques to identify visual elements and relations, e.g., CNNs to classify charts, yolov5 and optical character recognition to parse charts, and rule-based methods to construct graphs. We present four cases to illustrate how our knowledge-graph-based representation can model the detailed visual elements and semantic relations in charts, and further demonstrate how our approach can benefit downstream applications such as semantic-aware chart retrieval and chart question answering. We also conduct quantitative evaluations to assess the two fundamental building blocks of our chart-to-KG framework, i.e., object recognition and optical character recognition. The results provide support for the usefulness and effectiveness of ChartKG.

chart image, machine learning, pattern recognition, (22 more...)

arXiv.org Artificial Intelligence

2410.09761

Country:

Asia > China > Zhejiang Province > Hangzhou (0.05)
Europe > United Kingdom > England (0.04)
Europe > Ukraine (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Education (0.68)
Health & Medicine (0.46)

Technology:

Information Technology > Visualization (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
(4 more...)

Add feedback

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Neural Information Processing SystemsOct-11-2024, 12:40:42 GMT

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

human-level text-to-speech, style diffusion and adversarial training, styletts 2, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.99)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.65)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsOct-11-2024, 06:22:38 GMT

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.

artificial intelligence, machine learning, optical character recognition, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.77)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.64)
Information Technology > Artificial Intelligence > Assistive Technologies (0.64)

Add feedback

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Neural Information Processing SystemsOct-11-2024, 05:25:26 GMT

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel. After analyzing two kinds of generative NAR-TTS models (VAE and normalizing flow), we find that: VAE is good at capturing the long-range semantics features (e.g., prosody) even with small model size but suffers from blurry and unnatural results; and normalizing flow is good at reconstructing the frequency bin-wise details but performs poorly when the number of model parameters is limited. Inspired by these observations, to generate diverse speech with natural details and rich prosody using a lightweight architecture, we propose PortaSpeech, a portable and high-quality generative text-to-speech model. Specifically, 1) to model both the prosody and mel-spectrogram details accurately, we adopt a lightweight VAE with an enhanced prior followed by a flow-based post-net with strong conditional inputs as the main architecture. Experimental results show that PortaSpeech outperforms other TTS models in both voice quality and prosody modeling in terms of subjective and objective evaluation metrics, and shows only a slight performance degradation when reducing the model parameters to 6.7M (about 4x model size and 3x runtime memory compression ratio compared with FastSpeech 2).

model size, portable and high-quality generative text-to-speech, portaspeech, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.86)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)
Information Technology > Artificial Intelligence > Assistive Technologies (0.86)

Add feedback

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Neural Information Processing SystemsOct-11-2024, 00:07:20 GMT

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.

dict-tts, dictionary knowledge, text-to-speech, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.89)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.89)
Information Technology > Artificial Intelligence > Assistive Technologies (0.89)

Add feedback

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Neural Information Processing SystemsOct-10-2024, 21:38:43 GMT

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.

generalizable out-of-domain text-to-speech, generspeech, style transfer, (1 more...)

Neural Information Processing Systems

Genre: Play > Prospect > Charge (0.86)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.98)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.64)
Information Technology > Artificial Intelligence > Assistive Technologies (0.64)

Add feedback

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Neural Information Processing SystemsOct-10-2024, 07:12:02 GMT

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis.

generative flow, glow-tts, monotonic alignment search, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.66)
Information Technology > Artificial Intelligence > Assistive Technologies (0.66)

Add feedback

Speech-T: Transducer for Text to Speech and Beyond

Neural Information Processing SystemsOct-10-2024, 00:46:10 GMT

Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous mel-spectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.

alignment, probability lattice, transducer, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.63)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.60)

Add feedback

Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification

Neural Information Processing SystemsOct-9-2024, 21:53:46 GMT

We introduce Meta-Album, an image classification meta-dataset designed to facilitate few-shot learning, transfer learning, meta-learning, among other tasks. It includes 40 open datasets, each having at least 20 classes with 40 examples per class, with verified licences. They stem from diverse domains, such as ecology (fauna and flora), manufacturing (textures, vehicles), human actions, and optical character recognition, featuring various image scales (microscopic, human scales, remote sensing). All datasets are preprocessed, annotated, and formatted uniformly, and come in 3 versions (Micro \subset Mini \subset Extended) to match users' computational resources. The other 10 will be released shortly after.

few-shot image classification, meta-album, multi-domain meta-dataset

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.64)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Reviews: Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsOct-8-2024, 08:47:20 GMT

This paper presents a solid piece of work on the speaker-dependent neural TTS system, building on previous works of Deep Voice and Tacotron architecture. The key idea is to learn a speaker-dependent embedding vector jointly with the neural TTS model. The paper is clearly written, and the experiments are presented well. My comments are as follows. ASR researchers later find that using fixed speaker embeddings such i-vectors can work equally well (or even better).

deep voice 2, multi-speaker neural text-to-speech, speaker adaptation, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback