Microsoft is continuing to add to its growing stable of AI services by adding more Cognitive Services application programming interfaces (APIs). On July 8, Microsoft unveiled a new health-specific text analytics API and made generally available a couple of previously announced Cognitive Services APIs. The idea behind Azure Cognitive Services is to enable developers to use an API call in building heir apps to add sight, speech, search and other AI capabilities without having to master machine learning techniques first. Microsoft also has made a number of its Cognitive Services available in containers, which helps when building AI systems that run at scale, reliably, and consistently in a way that supports better data governance, officials have said. Text Analytics for Health, which is now in preview, is meant to enable health care providers, researchers and vendors to extract insights and relationships from unstructured medical data, Microsoft officials said.
With the mankind being largely dependent on artificial intelligence, here is a list of AI platforms that are pulling the strings in the industry. For those unaware, Artificial Intelligence alludes the re-enactment of human insight into machine so as to enable them to think like members of the human race. Thus, attributes like problem solving, learning and critical thinking are carried on by machines. Artificial intelligence brings along a colossal potential to the table which is ultimately sculpturing the fate of technology in future. Thus, its no surprise that business industry is investing more and more in this platform that holds the promise of changing the world as we know it.
English is one of the most widely used languages worldwide, with approximately 1.2 billion speakers. In order to maximise the performance of speech-to-text systems it is vital to build them in a way that recognises different accents. Recently, spoken dialogue systems have been incorporated into various devices such as smartphones, call services, and navigation systems. These intelligent agents can assist users in performing daily tasks such as booking tickets, setting-up calendar items, or finding restaurants via spoken interaction. They have the potential to be more widely used in a vast range of applications in the future, especially in the education, government, healthcare, and entertainment sectors.
In the latest in this series of posts, researchers from the EU-funded COMPRISE project write about privacy issues associated with voice assistants. They propose possible ways to maintain the privacy of users whilst ensuring that manufacturers can still access the quality usage data vital for improving the functionality of their products. "Tell me what you read and I tell you who you are." – Pierre de La Gorce Voice assistants, such as Alexa, Siri, or Google Assistant, are becoming increasingly popular. Some users are, however, worried about their vocal interactions with these devices being stored in the cloud, together with a textual transcript of every spoken word. But is there an actual threat associated with the collection of these data?
Machine learning algorithms have come a long way since the inception of basic AI such as those in video-games from the 1990's. Over the course of several years, artificial intelligent powered voice assistants such as Siri, Alexa, and Google Assistant have improved significantly, redefining and changing the way we interact with AI from on screen to voice control. It's hard to comprehend just how far we have come in terms of our research and knowledge on artificial intelligence. We're surrounded by artificial intelligence in our day-to-day lives so it comes as no surprise that we don't seem to recognise the many hurdles we have overcome in constructing complex machine learning algorithms in recent years. The image below shows the remarkable advances we have made in the field of machine learning.
Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the data-scarce languages. However, most state-of-the-art multilingual models require the encoding of language information and therefore are not as flexible or scalable when expanding to newer languages. Language-independent multilingual models help to address this issue, and are also better suited for multicultural societies where several languages are frequently used together (but often rendered with different writing systems). In this paper, we propose a new approach to building a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer. Thus, similar sounding acoustics are mapped to a single, canonical target sequence of graphemes, effectively separating the modeling and rendering problems. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
I really like the expression "being bitten by the SOTA bug". In a nut shell it means that if a large group of people focuses on pursuing a top result on some abstract metric, this metric loses its meaning (a classic manifestation of Goodhart's Law). The exact reason why this happens is usually different each time and it may be very technical, but in ML what is usually occurring is that the models are overfit to some hidden intrinsic qualities of the dataset that are used to calculate the metrics. For example, in CV such patterns are usually clusters of visually similar images. A small idealistic under-the-radar community pursuing an academic or scientific goal is much less prone to falling victim to Goodhart's law than a larger and more popular community. Once a certain degree of popularity is reached, the community starts pursuing metrics or virtue signalling (showing off one's moral values for the sake of showing off when no real effort is required) and the real progress stops until some crisis arrives. This is what it means to be bitten by the SOTA bug. For example, in the field of Natural Language Processing this attitude has lead to irrational over-investment into huge models optimized on public academic benchmarks, but the usefulness of such "progress" is very limited for a number of reasons:
We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.
Edge intelligence refers to a set of connected systems and devices for data collection, caching, processing, and analysis in locations close to where data is captured based on artificial intelligence. The aim of edge intelligence is to enhance the quality and speed of data processing and protect the privacy and security of the data. Although recently emerged, spanning the period from 2011 to now, this field of research has shown explosive growth over the past five years. In this paper, we present a thorough and comprehensive survey on the literature surrounding edge intelligence. We first identify four fundamental components of edge intelligence, namely edge caching, edge training, edge inference, and edge offloading, based on theoretical and practical results pertaining to proposed and deployed systems. We then aim for a systematic classification of the state of the solutions by examining research results and observations for each of the four components and present a taxonomy that includes practical problems, adopted techniques, and application goals. For each category, we elaborate, compare and analyse the literature from the perspectives of adopted techniques, objectives, performance, advantages and drawbacks, etc. This survey article provides a comprehensive introduction to edge intelligence and its application areas. In addition, we summarise the development of the emerging research field and the current state-of-the-art and discuss the important open issues and possible theoretical and technical solutions.
Automatic Speech Recognition (ASR) has increased in popularity in recent years. The evolution of processor and storage technologies has enabled more advanced ASR mechanisms, fueling the development of virtual assistants such as Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Home. The interest in such assistants, in turn, has amplified the novel developments in ASR research. However, despite this popularity, there has not been a detailed training efficiency analysis of modern ASR systems. This mainly stems from: the proprietary nature of many modern applications that depend on ASR, like the ones listed above; the relatively expensive co-processor hardware that is used to accelerate ASR by big vendors to enable such applications; and the absence of well-established benchmarks. The goal of this paper is to address the latter two of these challenges. The paper first describes an ASR model, based on a deep neural network inspired by recent work in this domain, and our experiences building it. Then we evaluate this model on three CPU-GPU co-processor platforms that represent different budget categories. Our results demonstrate that utilizing hardware acceleration yields good results even without high-end equipment. While the most expensive platform (10X price of the least expensive one) converges to the initial accuracy target 10-30% and 60-70% faster than the other two, the differences among the platforms almost disappear at slightly higher accuracy targets. In addition, our results further highlight both the difficulty of evaluating ASR systems due to the complex, long, and resource intensive nature of the model training in this domain, and the importance of establishing benchmarks for ASR.