Goto

Collaborating Authors

Google Brain Co-Founder Teams With Foxconn to Bring AI to Factories

#artificialintelligence

Consumers now experience AI mostly through image recognition to help categorize digital photographs and speech recognition that helps power digital voice assistants such as Apple Inc's Siri or Amazon.com But at a press briefing in San Francisco two days before Ng's Landing.ai In many factories, workers look over parts coming off an assembly line for defects. Ng showed a video in which a worker instead put a circuit board beneath a digital camera connected to a computer and the computer identified a defect in the part. Ng said that while typical computer vision systems might require thousands of sample images to become "trained," Landing.ai's


Google Brain co-founder teams with Foxconn to bring AI to factories

#artificialintelligence

Consumers now experience AI mostly through image recognition to help categorize digital photographs and speech recognition that helps power digital voice assistants such as Apple Inc's Siri or Amazon.com But at a press briefing in San Francisco two days before Ng's Landing.ai In many factories, workers look over parts coming off an assembly line for defects. Ng showed a video in which a worker instead put a circuit board beneath a digital camera connected to a computer and the computer identified a defect in the part. Ng said that while typical computer vision systems might require thousands of sample images to become "trained," Landing.ai's


The IBM Speaker Recognition System: Recent Advances and Error Analysis

arXiv.org Machine Learning

We present the recent advances along with an error analysis of the IBM speaker recognition system for conversational speech. Some of the key advancements that contribute to our system include: a nearest-neighbor discriminant analysis (NDA) approach (as opposed to LDA) for intersession variability compensation in the i-vector space, the application of speaker and channel-adapted features derived from an automatic speech recognition (ASR) system for speaker recognition, and the use of a DNN acoustic model with a very large number of output units (~10k senones) to compute the frame-level soft alignments required in the i-vector estimation process. We evaluate these techniques on the NIST 2010 SRE extended core conditions (C1-C9), as well as the 10sec-10sec condition. To our knowledge, results achieved by our system represent the best performances published to date on these conditions. For example, on the extended tel-tel condition (C5) the system achieves an EER of 0.59%. To garner further understanding of the remaining errors (on C5), we examine the recordings associated with the low scoring target trials, where various issues are identified for the problematic recordings/trials. Interestingly, it is observed that correcting the pathological recordings not only improves the scores for the target trials but also for the nontarget trials.


Can We Use Speaker Recognition Technology to Attack Itself? Enhancing Mimicry Attacks Using Automatic Target Speaker Selection

arXiv.org Machine Learning

ABSTRACT We consider technology-assisted mimicry attacks in the context of automatic speaker verification (ASV). We use ASV itself to select targeted speakers to be attacked by human-based mimicry. We recorded 6 naive mimics for whom we select target celebrities from VoxCeleb1 and VoxCeleb2 corpora (7,365 potential targets) using an i-vector system. The attacker attempts to mimic the selected target, with the utterances subjected to ASV tests using an independently developed x-vector system. Our main finding is negative: even if some of the attacker scores against the target speakers were slightly increased, our mimics did not succeed in spoofing the x-vector system. Interestingly, however, the relative ordering of the selected targets (closest, furthest, median) are consistent between the systems, which suggests some level of transferability between the systems.


Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

arXiv.org Machine Learning

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.