Column Similarity: Metadata Intelligence for Curation and Consumption


Ability to accurately label columns, attributes and fields is a critical requirement for both data discovery and data governance. However, organizations can have millions of datasets and hundreds of millions of columns/fields in various structured and semi-structured data sources, making it impossible to manually curate them one by one. Also, not all columns represent unique business concepts/data elements. A single data element, like a CUSTOMER ID or PRODUCT ID, can be a part of multiple datasets. Machine learning can help cluster these "instances" of data elements together based on data similarity.

Flipboard on Flipboard


One particular advancement driven by machine learning is the ability for computers to understand natural language, with Google showcasing these improvements with Smart Reply. Its Research division has been exploring other applications and today releasing two fun and interesting demos. Last year, Google was able to increase the percentage of Smart Reply's usable suggestions by using hierarchical vector models of language: Natural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. These improvements can drive new search experiences as Google is demoing with "Talk to Books."

Recommender Systems through Collaborative Filtering


This is a technical deep dive of the collaborative filtering algorithm and how to use it in practice. From Amazon recommending products you may be interested in based on your recent purchases to Netflix recommending shows and movies you may want to watch, recommender systems have become popular across many applications of data science. Like many other problems in data science, there are several ways to approach recommendations. Two of the most popular are collaborative filtering and content-based recommendations. Content-based Recommendations: If companies have detailed metadata about each of your items, they can recommend items with similar metadata tags.

Three Ways Expert Knowledge Enables Artificial Intelligence


In engineering businesses where something is manufactured or assembled, specifications tell suppliers qualities and characteristics of the material they or customers require. Specifications could be based on physics (temperature ranges) or business objectives (material preferences that allow them to achieve cost efficiencies at scale). Specifications are also one way for businesses to make data-driven process improvements, like optimizing supply chains. This example represents an important use case I've encountered in many places where artificial intelligence can provide quantifiable value. Experts typically capture their knowledge and reasoning about complex knowledge like specifications in unstructured text (comment fields attached to documents, manually written reports) to draw upon later.

Word2Vec word embedding tutorial in Python and TensorFlow - Adventures in Machine Learning


In coming tutorials on this blog I will be dealing with how to create deep learning models that predict text sequences. However, before we get to that point we have to understand some key Natural Language Processing (NLP) ideas. One of the key ideas in NLP is how we can efficiently convert words into numeric vectors which can then be "fed into" various machine learning models to perform predictions. The current key technique to do this is called "Word2Vec" and this is what will be covered in this tutorial. After discussing the relevant background material, we will be implementing Word2Vec embedding using TensorFlow (which makes our lives a lot easier). To get up to speed in TensorFlow, check out my TensorFlow tutorial. Also, if you prefer Keras – check out my Word2Vec Keras tutorial.

Large Scale Local Online Similarity/Distance Learning Framework based on Passive/Aggressive Machine Learning

Similarity/Distance measures play a key role in many machine learning, pattern recognition, and data mining algorithms, which leads to the emergence of metric learning field. Many metric learning algorithms learn a global distance function from data that satisfy the constraints of the problem. However, in many real-world datasets that the discrimination power of features varies in the different regions of input space, a global metric is often unable to capture the complexity of the task. To address this challenge, local metric learning methods are proposed that learn multiple metrics across the different regions of input space. Some advantages of these methods are high flexibility and the ability to learn a nonlinear mapping but typically achieves at the expense of higher time requirement and overfitting problem. To overcome these challenges, this research presents an online multiple metric learning framework. Each metric in the proposed framework is composed of a global and a local component learned simultaneously. Adding a global component to a local metric efficiently reduce the problem of overfitting. The proposed framework is also scalable with both sample size and the dimension of input data. To the best of our knowledge, this is the first local online similarity/distance learning framework based on PA (Passive/Aggressive). In addition, for scalability with the dimension of input data, DRP (Dual Random Projection) is extended for local online learning in the present work. It enables our methods to be run efficiently on high-dimensional datasets, while maintains their predictive performance. The proposed framework provides a straightforward local extension to any global online similarity/distance learning algorithm based on PA.

Clinical Concept Embeddings Learned from Massive Sources of Medical Data Machine Learning

Word embeddings have emerged as a popular approach to unsupervised learning of word relationships in machine learning and natural language processing. In this article, we benchmark two of the most popular algorithms, GloVe and word2vec, to assess their suitability for capturing medical relationships in large sources of biomedical data. Leaning on recent theoretical insights, we provide a unified view of these algorithms and demonstrate how different sources of data can be combined to construct the largest ever set of embeddings for 108,477 medical concepts using an insurance claims database of 60 million members, 20 million clinical notes, and 1.7 million full text biomedical journal articles. We evaluate our approach, called cui2vec, on a set of clinically relevant benchmarks and in many instances demonstrate state of the art performance relative to previous results. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings.

Convolutional Neural Networks Regularized by Correlated Noise Machine Learning

Neurons in the visual cortex are correlated in their variability. The presence of correlation impacts cortical processing because noise cannot be averaged out over many neurons. In an effort to understand the functional purpose of correlated variability, we implement and evaluate correlated noise models in deep convolutional neural networks. Inspired by the cortex, correlation is defined as a function of the distance between neurons and their selectivity. We show how to sample from high-dimensional correlated distributions while keeping the procedure differentiable, so that back-propagation can proceed as usual. The impact of correlated variability is evaluated on the classification of occluded and non-occluded images with and without the presence of other regularization techniques, such as dropout. More work is needed to understand the effects of correlations in various conditions, however in 10/12 of the cases we studied, the best performance on occluded images was obtained from a model with correlated noise.

Listing Embeddings for Similar Listing Recommendations and Real-time Personalization in Search


Airbnb's marketplace contains millions of diverse listings which potential guests explore through search results generated from a sophisticated Machine Learning model that uses more than hundred signals to decide how to rank a particular listing on the search page. Once a guest views a home they can continue their search by either returning to the results or by browsing the Similar Listing Carousel, where listing recommendations related to the current listing are shown. In this blog post we describe a Listing Embedding technique we developed and deployed at Airbnb for the purpose of improving Similar Listing Recommendations and Real-Time Personalization in Search Ranking. The embeddings are vector representations of Airbnb homes learned from search sessions that allow us to measure similarities between listings. They effectively encode many listing features, such as location, price, listing type, architecture and listing style, all using only 32 float numbers.

Engineering a Simplified 0-Bit Consistent Weighted Sampling Machine Learning

The Min-Hashing approach to sketching has become an important tool in data analysis, search, and classification. To apply it to real-valued datasets, the ICWS algorithm has become a seminal approach that is widely used, and provides state-of-the-art performance for this problem space. However, ICWS suffers a computational burden as the sketch size K increases. We develop a new Simplified approach to the ICWS algorithm, that enables us to obtain over 20x speedups compared to the standard algorithm. The veracity of our approach is demonstrated empirically on multiple datasets, showing that our new Simplified CWS obtains the same quality of results while being an order of magnitude faster.