Information Retrieval
The Use of NLP to Extract Unstructured Medical Data From Text - insideBIGDATA
When working in healthcare, a lot of the relevant information for making accurate predictions and recommendations is only available in free-text clinical notes. Much of this data is trapped in free-text documents in unstructured form. This data is needed in order to make healthcare decisions. Hence, it is important to be able to extract data in the best possible way such that the information obtained can be analyzed and used. State-of-the-art NLP algorithms can extract clinical data from text using deep learning techniques such as healthcare-specific word embeddings, named entity recognition models, and entity resolution models.
Can China's new AI news anchors give Anderson Cooper a run for his money?
China's state-owned Xinhua News Agency introduced so-called "composite anchors" on Wednesday, combining the images and voices of human anchors with artificial intelligence (AI) technology. The new AI anchors, launched by Xinhua and Beijing-based search engine operator Sogou during the World Internet Conference in Wuzhen, can deliver the news with "the same effect" as human anchors because the machine learning programme is able to synthesise realistic-looking speech, lip movements and facial expressions, according to a Xinhua news report on Wednesday. "AI anchors have officially become members of the Xinhua News Agency reporting team. They will work with other anchors to bring you authoritative, timely and accurate news information in both Chinese and English," Xinhua said. The AI anchors are now available throughout Xinhua's internet and mobile platforms such as its official Chinese and English apps, WeChat public account, and online TV webpage.
Analysis of Google's New Schema Speakable Markup - Search Engine Journal
Google announced official support for the Schema.org The speakable specification will help Google Assistant and Google Home choose which content to read aloud. This new structured data markup is important because it may point to what you'll need to know to get more traffic should/when Google expands this structured data to all websites. The support for this new markup is currently limited to News content. However, it is likely that support for the speakable attribute will inevitably expand as Google gains experience with this new structured data markup.
Satyam: Democratizing Groundtruth for Machine Vision
Qiu, Hang, Chintalapudi, Krishna, Govindan, Ramesh
The democratization of machine learning (ML) has led to ML-based machine vision systems for autonomous driving, traffic monitoring, and video surveillance. However, true democratization cannot be achieved without greatly simplifying the process of collecting groundtruth for training and testing these systems. This groundtruth collection is necessary to ensure good performance under varying conditions. In this paper, we present the design and evaluation of Satyam, a first-of-its-kind system that enables a layperson to launch groundtruth collection tasks for machine vision with minimal effort. Satyam leverages a crowdtasking platform, Amazon Mechanical Turk, and automates several challenging aspects of groundtruth collection: creating and launching of custom web-UI tasks for obtaining the desired groundtruth, controlling result quality in the face of spammers and untrained workers, adapting prices to match task complexity, filtering spammers and workers with poor performance, and processing worker payments. We validate Satyam using several popular benchmark vision datasets, and demonstrate that groundtruth obtained by Satyam is comparable to that obtained from trained experts and provides matching ML performance when used for training.
Experimentation & Measurement for Search Engine Optimization
For many of our potential guests, planning a trip starts at the search engine. At Airbnb, we want our product to be painless to find for past guests, and easy to discover for new ones. Search engine optimization (SEO) is the process of improving our site -- and more specifically our landing pages--to ensure that when a traveller looks for accommodations for their next trip, Airbnb is one of the top results on their favorite search engine. Search engines such as Google, Yahoo, Naver, and Baidu deploy their own fleet of "bots" across the internet to build map of the web and scrape information, or "index", from the pages that they hit. When indexing pages and ranking them for specific search queries, search engines will take into account a variety of factors, including relevance, site performance, and authority.
SimplerVoice: A Key Message & Visual Description Generator System for Illiteracy
Nguyen, Minh N. B., Thomas, Samuel, Gattiker, Anne E., Kashyap, Sujatha, Varshney, Kush R.
We introduce SimplerVoice: a key message and visual description generator system to help low-literate adults navigate the information-dense world with confidence, on their own. SimplerVoice can automatically generate sensible sentences describing an unknown object, extract semantic meanings of the object usage in the form of a query string, then, represent the string as multiple types of visual guidance (pictures, pictographs, etc.). We demonstrate SimplerVoice system in a case study of generating grocery products' manuals through a mobile application. To evaluate, we conducted a user study on SimplerVoice's generated description in comparison to the information interpreted by users from other methods: the original product package and search engines' top result, in which SimplerVoice achieved the highest performance score: 4.82 on 5-point mean opinion score scale. Our result shows that SimplerVoice is able to provide low-literate end-users with simple yet informative components to help them understand how to use the grocery products, and that the system may potentially provide benefits in other real-world use cases.
Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs
Maheshwari, Gaurav, Trivedi, Priyansh, Lukovnikov, Denis, Chakraborty, Nilesh, Fischer, Asja, Lehmann, Jens
In this paper, we conduct an empirical investigation of neural query graph ranking approaches for the task of complex question answering over knowledge graphs. We experiment with six different ranking models and propose a novel self-attention based slot matching model which exploits the inherent structure of query graphs, our logical form of choice. Our proposed model generally outperforms the other models on two QA datasets over the DBpedia knowledge graph, evaluated in different settings. In addition, we show that transfer learning from the larger of those QA datasets to the smaller dataset yields substantial improvements, effectively offsetting the general lack of training data.
What is Text Clustering? - insideBIGDATA
Automatic document organization, topic extraction, information retrieval and filtering all have one thing in common. They require text clustering (sometimes also known as document clustering) to be done quickly and accurately. If you've never heard of text clustering, this post will explain what it is, what it does, and how its currently being used to aid businesses. We'll also briefly discuss how a business could employ text clustering too! First, let's define text clustering.
NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval
Li, Canjia, Sun, Yingfei, He, Ben, Wang, Le, Hui, Kai, Yates, Andrew, Sun, Le, Xu, Jungang
Pseudo-relevance feedback (PRF) is commonly used to boost the performance of traditional information retrieval (IR) models by using top-ranked documents to identify and weight new query terms, thereby reducing the effect of query-document vocabulary mismatches. While neural retrieval models have recently demonstrated strong results for ad-hoc retrieval, combining them with PRF is not straightforward due to incompatibilities between existing PRF approaches and neural architectures. To bridge this gap, we propose an end-to-end neural PRF framework that can be used with existing neural IR models by embedding different neural models as building blocks. Extensive experiments on two standard test collections confirm the effectiveness of the proposed NPRF framework in improving the performance of two state-of-the-art neural IR models.
Gradual Machine Learning for Entity Resolution
Hou, Boyi, Chen, Qun, Wang, Yanyan, Zhong, Ping, Murtadha, Ahmed, Chen, Zhaoqiang, Li, Zhanhuai
Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine learning without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and it is highly competitive with the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.