Note: Before starting Part 2, be sure to read Part 1! When it comes to machine learning, ultimately the most important picture to have is the big picture. Whether it's logistic regression, random forests, Bayesian methods, support vector machines, or neural nets, everyone seems to have their favorite! Unfortunately these discussions tend to truncate the challenges of machine learning into a single problem, which is a particularly problematic misrepresentation for people who are just getting started with machine learning. Sure, picking a good model is important, but it's certainly not enough (and it's debatable whether a model can actually be'good' devoid of the context of the domain, the hypothesis, the shape of the data, and the intended application. In this post we'll discuss model selection in the context of the big picture, which I'll present in terms of the model selection triple, and we'll explore a set of visual tools for navigating the triple.
Artificial intelligence and machine learning became, in a few years, key technologies for professionals and organizations to master, to stay in the game and ahead of the competition. Organizations are starting to invest heavily in machine learning, and we already see highly positive results. In simple words, a dataset is a collection of data. It is usually organized as a table with data and column names. Not very different than what you are used to work with when using Excel.
Generating datasets that "look like" given real ones is an interesting tasks for healthcare applications of ML and many other fields of science and engineering. In this paper we propose a new method of general application to binary datasets based on a method for learning the parameters of a latent variable moment that we have previously used for clustering patient datasets. We compare our method with a recent proposal (MedGan) based on generative adversarial methods and find that the synthetic datasets we generate are globally more realistic in at least two senses: real and synthetic instances are harder to tell apart by Random Forests, and the MMD statistic. The most likely explanation is that our method does not suffer from the "mode collapse" which is an admitted problem of GANs. Additionally, the generative models we generate are easy to interpret, unlike the rather obscure GANs. Our experiments are performed on two patient datasets containing ICD-9 diagnostic codes: the publicly available MIMIC-III dataset and a dataset containing admissions for congestive heart failure during 7 years at Hospital de Sant Pau in Barcelona.
We can see the interchangeableness directly in Kuhn and Johnson's excellent text "Applied Predictive Modeling". In this example, they are clear to point out that the final model evaluation must be performed on a held out dataset that has not been used prior, either for training the model or tuning the model parameters. Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness. When a large amount of data is at hand, a set of samples can be set aside to evaluate the final model. The "training" data set is the general term for the samples used to create the model, while the "test" or "validation" data set is used to qualify performance.