Computing the Similarity Between Two Machine Learning Datasets -- Visual Studio Magazine

#artificialintelligence 

At first thought, computing the similarity/distance between two datasets sounds easy, but in fact the problem is extremely difficult, explains Dr. James McCaffrey of Microsoft Research. A fairly common sub-problem in many machine learning and data science scenarios is the need to compute the similarity (or difference or distance) between two datasets. For example, if you select a sample from a huge set of training data, you likely want to know how similar the sample dataset is to the source dataset. Or if you want to prime the training for a very deep neural network, you need to find an existing model that was trained using a dataset that is most similar to your new dataset. At first thought, computing the similarity/distance between two datasets sounds easy, but in fact the problem is extremely difficult. If you try to compare individual lines between datasets, you quickly run into the combinatorial explosion problem -- there are just too many comparisons.