If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."
However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …
Free-text radiology reports can be automatically classified by convolutional neural networks (CNNs) powered by deep-learning algorithms with accuracy that's equal to or better than that achieved by traditional--and decidedly labor-intensive--natural language processing (NLP) methods. That's the conclusion of researchers led by Matthew Lungren, MD, MPH, of Stanford University. The team tested a CNN model they developed for mining pulmonary-embolism findings from thoracic CT reports generated at two institutions. Radiology published their study, lead-authored by Matthew Chen, MS, also of Stanford, online Nov. 13. The researchers analyzed annotations made by two radiologists for the presence, chronicity and location of pulmonary embolisms, then compared their CNN's performance with that of an NLP model considered quite proficient in this task, called PeFinder.
SMOTE, Synthetic Minority Oversampling TEchnique and its variants are techniques for solving this problem through oversampling that have recently become a very popular way to improve model performance. Resample Your Training Data, NOT Your Validation and Holdout Data: Here's a common mistake, resampling all the data then selecting your validation and holdout sets. SMOTE, Synthetic Minority Oversampling Technique (Chawla, 2002) and its several variants will be our focus. But SMOTE has become a mainstream choice as it tries to enhance the separation between majority and minority classes making the classification more accurate.
So we'll apply it to build a model that depends on a cost function and check whether it performed better than the models built from raw (or automatically balanced) data. A batch prediction receives a model ID and a test dataset ID and runs all the instances of the test dataset through the model. Sometimes your predictions will be right when they predict the positive class (TP true positives) and sometimes otherwise (TN true negatives). There are two possibilities for the predictions to be wrong: instances that are predicted to be of the positive class and are not (FP false positives), and instances of the positive class whose prediction fails (FN false negatives).
The competition aimed to assess the state of the art in AI systems utilizing natural language understanding and knowledge-based reasoning; how accurately the participants' models could answer the exam questions would serve as an indicator of how far the field has come in these areas. A week before the end of the competition, we provided the final test set of 21,298 questions (including the validation set) to participants to use to produce a final score for their models, of which 2,583 were legitimate. AI2 also generated a baseline score using a Lucene search over the Wikipedia corpus, producing scores of 40.2% on the training set and 40.7% on the final test set. His model achieved a final score of 59.31% correct on the test question set of 2,583 questions using a combination of 15 gradient-boosting models, each with a different subset of features.
Let's go through an example of telecom customer churn: Decision trees create a model that predicts the class or label based on several input features. Spark ML supports k-fold cross validation with a transformation/estimation pipeline to try out different combinations of parameters, using a process called grid search, where you set up the parameters to test, and a cross validation evaluator to construct a model selection workflow. It's not surprising that these feature numbers map to the fields Customer service calls and Total day minutes. In this blog post, we showed you how to get started using Apache Spark's machine learning decision trees and ML pipelines for classification.
The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models. Generally, the term "validation set" is used interchangeably with the term "test set" and refers to a sample of the dataset held back from training the model. Importantly, Russell and Norvig comment that the training dataset used to fit the model can be further split into a training set and a validation set, and that it is this subset of the training dataset, called the validation set, that can be used to get an early estimate of the skill of the model. In addition to reiterating Ripley's glossary definitions, it goes on to discuss the common misuse of the terms "test set" and "validation set" in applied machine learning.
We are also provided with a training set of full run-to-failure data for a number of engines and a test set with truncated engine data and their corresponding RUL values. One way of addressing this is to look at the distribution of sensor values in "healthy" engines, and compare it to a similar set of measurements when the engines are close to failure. The figure above shows the distribution of the values of a particular sensor (sensor 2) for each engine in the training set, where healthy values (in blue) are those taken from the first 20 cycles of the engine's lifetime and failing values are from the last 20 cycles. In blue are the values of a particular sensor (sensor 2 in this case) plotted against the true RUL value at each time cycle for the engines in the training set.
A common technique for model selection is k-fold cross validation, where the data is randomly split into k partitions. Each partition is used once as the testing data set, while the rest are used for training. Models are then generated using the training sets and evaluated with the testing sets, resulting in k model performance measurements. For model selection we can search through the model parameters, comparing their cross validation performances.
Import some keras goodness (and perhaps run pip install keras first if you need it). And to keep the checkpoint just before overfitting occurs, ModelCheckpoints let's us save the best model before decline in validation set performance. We will now do the same with an good old xgboost (conda install xgboost) with the nice sklearn api. On these datasets, training the ANN takes no time at all.
Deep learning models are complex and tricky to train, and I had a hunch that lack of model convergence/difficulties training probably explained the poor performance, not overfitting. We recreated python versions of the Leekasso and MLP used in the original post to the best of our ability, and the code is available here. The MLP used in the original analysis still looks pretty bad for small sample sizes, but our neural nets get essentially perfect accuracy for all sample sizes. A lot of parameters are problem specific (especially the parameters related to SGD) and poor choices will result in misleadingly bad performance.