Goto

Collaborating Authors

 Pattern Recognition


Mozilla's open source voice recognition tool nears human-like accuracy

#artificialintelligence

The free-software company also on Wednesday released a first set of crowdsourced recordings under its Common Voice project, designed to let anyone train and test machine learning algorithms to recognize speech. The dataset includes almost 400,000 downloadable samples, adding up to 500 hours of speech. More than 20,000 people from around the world have contributed to a call for recordings, which Mozilla hopes will help future voice-powered systems fluently understand a wide variety of accents and types of speech. "We at Mozilla believe technology should be open and accessible to all, and that includes voice," Mozilla Senior Vice President of Emerging Technologies Sean White wrote in a blog post. The speech recognition tool, called DeepSpeech, has an impressive per-word error rate of about 6.5%, ahead of the company's stated goal of 10%, but still shy of Microsoft's achievement this year of 5.5%.


SEE: Towards Semi-Supervised End-to-End Scene Text Recognition

AAAI Conferences

Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present SEE, a step towards semi-supervised neural networks for scene text detection and recognition, that can be optimized end-to-end. Most existing works consist of multiple deep neural networks and several pre-processing steps. In contrast to this, we propose to use a single deep neural network, that learns to detect and recognize text from natural images, in a semi-supervised way. SEE is a network that integrates and jointly learns a spatial transformer network, which can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results.


Towards Affordable Semantic Searching: Zero-Shot Retrieval via Dominant Attributes

AAAI Conferences

Instance-level retrieval has become an essential paradigm to index and retrieves images from large-scale databases. Conventional instance search requires at least an example of the query image to retrieve images that contain the same object instance. Existing semantic retrieval can only search semantically-related images, such as those sharing the same category or a set of tags, not the exact instances. Meanwhile, the unrealistic assumption is that all categories or tags are known beforehand. Training models for these semantic concepts highly rely on instance-level attributes or human captions which are expensive to acquire. Given the above challenges, this paper studies the Zero-shot Retrieval problem that aims for instance-level image search using only a few dominant attributes. The contributions are: 1) we utilise automatic word embedding to infer class-level attributes to circumvent expensive human labelling; 2) the inferred class-attributes can be extended into discriminative instance attributes through our proposed Latent Instance Attributes Discovery (LIAD) algorithm; 3) our method is not restricted to complete attribute signatures, query of dominant attributes can also be dealt with. On two benchmarks, CUB and SUN, extensive experiments demonstrate that our method can achieve promising performance for the problem. Moreover, our approach can also benefit conventional ZSL tasks.


SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network

AAAI Conferences

A new approach for real-time scene text recognition is proposed in this paper. A novel binary convolutional encoder-decoder network (B-CEDNet) together with a bidirectional recurrent neural network (Bi-RNN). The B-CEDNet is engaged as a visual front-end to provide elaborated character detection, and a back-end Bi-RNN performs character-level sequential correction and classification based on learned contextual knowledge. The front-end B-CEDNet can process multiple regions containing characters using a one-off forward operation, and is trained under binary constraints with significant compression. Hence it leads to both remarkable inference run-time speedup as well as memory usage reduction. With the elaborated character detection, the back-end Bi-RNN merely processes a low dimension feature sequence with category and spatial information of extracted characters for sequence correction and classification. By training with over 1,000,000 synthetic scene text images, the B-CEDNet achieves a recall rate of 0.86, precision of 0.88 and F-score of 0.87 on ICDAR-03 and ICDAR-13. With the correction and classification by Bi-RNN, the proposed real-time scene text recognition achieves state-of-the-art accuracy while only consumes less than 1-ms inference run-time. The flow processing flow is realized on GPU with a small network size of 1.01 MB for B-CEDNet and 3.23 MB for Bi-RNN, which is much faster and smaller than the existing solutions.


Recurrent Attentional Reinforcement Learning for Multi-Label Image Recognition

AAAI Conferences

Recognizing multiple labels of images is a fundamental but challenging task in computer vision, and remarkable progress has been attained by localizing semantic-aware image regions and predicting their labels with deep convolutional neural networks. The step of hypothesis regions (region proposals) localization in these existing multi-label image recognition pipelines, however, usually takes redundant computation cost, e.g., generating hundreds of meaningless proposals with non-discriminative information and extracting their features, and the spatial contextual dependency modeling among the localized regions are often ignored or over-simplified. To resolve these issues, this paper proposes a recurrent attention reinforcement learning framework to iteratively discover a sequence of attentional and informative regions that are related to different semantic objects and further predict label scores conditioned on these regions. Besides, our method explicitly models long-term dependencies among these attentional regions that help to capture semantic label co-occurrence and thus facilitate multi-label recognition. Extensive experiments and comparisons on two large-scale benchmarks (i.e., PASCAL VOC and MS-COCO) show that our model achieves superior performance over existing state-of-the-art methods in both performance and efficiency as well as explicitly identifying image-level semantic labels to specific object regions.


Dilated FCN for Multi-Agent 2D/3D Medical Image Registration

AAAI Conferences

2D/3D image registration to align a 3D volume and 2D X-ray images is a challenging problem due to its ill-posed nature and various artifacts presented in 2D X-ray images. In this paper, we propose a multi-agent system with an auto attention mechanism for robust and efficient 2D/3D image registration. Specifically, an individual agent is trained with dilated Fully Convolutional Network (FCN) to perform registration in a Markov Decision Process (MDP) by observing a local region, and the final action is then taken based on the proposals from multiple agents and weighted by their corresponding confidence levels. The contributions of this paper are threefold. First, we formulate 2D/3D registration as a MDP with observations, actions, and rewards properly defined with respect to X-ray imaging systems. Second, to handle various artifacts in 2D X-ray images, multiple local agents are employed efficiently via FCN-based structures, and an auto attention mechanism is proposed to favor the proposals from regions with more reliable visual cues. Third, a dilated FCN-based training mechanism is proposed to significantly reduce the Degree of Freedom in the simulation of registration environment, and drastically improve training efficiency by an order of magnitude compared to standard CNN-based training method. We demonstrate that the proposed method achieves high robustness on both spine cone beam Computed Tomography data with a low signal-to-noise ratio and data from minimally invasive spine surgery where severe image artifacts and occlusions are presented due to metal screws and guide wires, outperforming other state-of-the-art methods (single agent-based and optimization-based) by a large margin.


EMD Metric Learning

AAAI Conferences

Earth Mover's Distance (EMD), targeting at measuring the many-to-many distances, has shown its superiority and been widely applied in computer vision tasks, such as object recognition, hyperspectral image classification and gesture recognition. However, there is still little effort concentrated on optimizing the EMD metric towards better matching performance. To tackle this issue, we propose an EMD metric learning algorithm in this paper. In our method, the objective is to learn a discriminative distance metric for EMD ground distance matrix generation which can better measure the similarity between compared subjects. More specifically, given a group of labeled data from different categories, we first select a subset of training data and then optimize the metric for ground distance matrix generation. Here, both the EMD metric and the EMD flow-network are alternatively optimized until a steady EMD value can be achieved. This method is able to generate a discriminative ground distance matrix which can further improve the EMD distance measurement. We then apply our EMD metric learning method on two tasks, i.e., multi-view object classification and document classification. The experimental results have shown better performance of our proposed EMD metric learning method compared with the traditional EMD method and the state-of-the-art methods. It is noted that the proposed EMD metric learning method can be also used in other applications.


Interpreting CNN Knowledge via an Explanatory Graph

AAAI Conferences

This paper learns a graphical model, namely an explanatory graph, which reveals the knowledge hierarchy hidden inside a pre-trained CNN. Considering that each filter in a conv-layer of a pre-trained CNN usually represents a mixture of object parts, we propose a simple yet efficient method to automatically disentangles different part patterns from each filter, and construct an explanatory graph. In the explanatory graph, each node represents a part pattern, and each edge encodes co-activation relationships and spatial relationships between patterns. More importantly, we learn the explanatory graph for a pre-trained CNN in an unsupervised manner, i.e., without a need of annotating object parts. Experiments show that each graph node consistently represents the same object part through different images. We transfer part patterns in the explanatory graph to the task of part localization, and our method significantly outperforms other approaches.


Strengthening the governance of AI

#artificialintelligence

There are opportunities across all functional areas for AI to augment and support human decision making, even the boardroom. Back in 2014, Deep Knowledge Ventures (DKV) made headlines when it appointed the first artificial intelligence (AI) algorithm, aptly named Vital, to its board of directors. Now more than three years out, I was curious to see how Vital fared – did it retain its seat at the boardroom table or get relegated back to the lab due to a failed investment? To my surprise, Vital has not only retained its seat, but has been credited with keeping DKV out of bankruptcy. Through its pattern recognition algorithm, Vital assessed risk factors across more than 50 parameters and recommended investments not identified by its human counterparts.


ClassSim: Similarity between Classes Defined by Misclassification Ratios of Trained Classifiers

arXiv.org Machine Learning

Deep neural networks (DNNs) have achieved exceptional performances in many tasks, particularly, in supervised classification tasks. However, achievements with supervised classification tasks are based on large datasets with well-separated classes. Typically, real-world applications involve wild datasets that include similar classes; thus, evaluating similarities between classes and understanding relations among classes are important. To address this issue, a similarity metric, ClassSim, based on the misclassification ratios of trained DNNs is proposed herein. We conducted image recognition experiments to demonstrate that the proposed method provides better similarities compared with existing methods and is useful for classification problems. Source code including all experimental results is available at https://github.com/karino2/ClassSim/.