Recognizing Variables from their Data via Deep Embeddings of Distributions

Mueller, Jonas, Smola, Alex

arXiv.org Machine Learning 

--A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a computationally efficient method to identify high-confidence variable matches between a given set of data values and a large repository of previously encountered datasets. Our approach enjoys numerous advantages over distributional similarity based techniques because we leverage learned vector embeddings of datasets which adaptively account for naturalforms of data variation encountered in practice. Based on the neural architecture of deep sets, our embeddings can be computed for both numeric and string data. In dataset search and schema matching tasks, our methods outperform standard statistical techniques and we find that the learned embeddings generalize well to new data sources. I NTRODUCTION Emerging ideas in automated analytics [1] and meta-learning across many datasets [2] offer great promise for improving both performance and tedium in the data science pipeline. However, a major obstacle remains: such methods generally have no knowledge about what type of real-world entity (i.e. In contrast, human analysts presented with new data often utilize this knowledge to recall previously-encountered datasets that contain the same sort of variables. Reviewing past experience with how different algorithms fared on these same variables enables a person to quickly leverage methods that work well for this type of data (e.g.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found