Diabetes is one of deadliest diseases in the world. It is not only a disease but also a creator of different kinds of diseases like heart attack, blindness, kidney diseases, etc. The normal identifying process is that patients need to visit a diagnostic center, consult their doctor, and sit tight for a day or more to get their reports. Moreover, every time they want to get their diagnosis report, they have to waste their money in vain. But with the rise of Machine Learning approaches we have the ability to find a solution to this issue, we have developed a system using data mining which has the ability to predict whether the patient has diabetes or not.
Holst, Anders (Swedish Institute of Computer Science) | Bohlin, Markus (Swedish Institute of Computer Science) | Ekman, Jan (Swedish Institute of Computer Science) | Sellin, Ola (Bombardier Transportation) | Lindström, Björn (Addiva Consulting AB) | Larsen, Stefan (Addiva Eduro AB)
We have developed a method for statistical anomaly detection which has been deployed in a tool for condition monitoring of train fleets. The tool is currently used by several railway operators over the world to inspect and visualize the occurrence of event messages generated on the trains. The anomaly detection component helps the operators to quickly find significant deviations from normal behavior and to detect early indications for possible problems. The savings in maintenance costs comes mainly from avoiding costly breakdowns, and have been estimated to several million Euros per year for the tool. In the long run, it is expected that maintenance costs can be reduced with between 5 and 10 % by using the tool.
Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly.
This was the subject of a question asked on Quora: What are the top 10 data mining or machine learning algorithms? Some modern algorithms such as collaborative filtering, recommendation engine, segmentation, or attribution modeling, are missing from the lists below. Algorithms from graph theory (to find the shortest path in a graph, or to detect connected components), from operations research (the simplex, to optimize the supply chain), or from time series, are not listed either. And I could not find MCM (Markov Chain Monte Carlo) and related algorithms used to process hierarchical, spatio-temporal and other Bayesian models. For the last one I'd let you pick one of the following: For the last one I'd let you pick one of the following: My point of view is of course biased, but I would like to also add some algorithms developed or re-developed at the Data Science Central's research lab: These algorithms are described in the article What you wont learn in statistics classes.
Data-driven anomaly detection methods typically build a model for the normal behavior of the target system, and score each data instance with respect to this model. A threshold is invariably needed to identify data instances with high (or low) scores as anomalies. This presents a practical limitation on the applicability of such methods, since most methods are sensitive to the choice of the threshold, and it is challenging to set optimal thresholds. We present a probabilistic framework to explicitly model the normal and anomalous behaviors and probabilistically reason about the data. An extreme value theory based formulation is proposed to model the anomalous behavior as the extremes of the normal behavior. As a specific instantiation, a joint non-parametric clustering and anomaly detection algorithm (INCAD) is proposed that models the normal behavior as a Dirichlet Process Mixture Model. A pseudo-Gibbs sampling based strategy is used for inference. Results on a variety of data sets show that the proposed method provides effective clustering and anomaly detection without requiring strong initialization and thresholding parameters.