Multivariate Anomaly Detection in Medicare using Model Residuals and Probabilistic Programming

AAAI Conferences

Anomalies in healthcare claims data can be indicative of possible fraudulent activities, contributing to a significant portion of overall healthcare costs. Medicare is a large government run healthcare program that serves the needs of the elderly in the United States. The increasing elderly population and their reliance on the Medicare program create an environment with rising costs and increased risk of fraud. The detection of these potentially fraudulent activities can recover costs and lessen the overall impact of fraud on the Medicare program. In this paper, we propose a new method to detect fraud by discovering outliers, or anomalies, in payments made to Medicare providers. We employ a multivariate outlier detection method split into two parts. In the first part, we create a multivariate regression model and generate corresponding residuals. In the second part, these residuals are used as inputs into a generalizable univariate probability model. We create this Bayesian probability model using probabilistic programming. Our results indicate our model is robust and less dependent on underlying data distributions, versus Mahalanobis distance. Moreover, we are able to demonstrate successful anomaly detection, within Medicare specialties, providing meaningful results for further investigation.

Machine Learning Prediction of Mortality and Hospitalization in Heart Failure with Preserved Ejection Fraction


Objectives This study sought to develop models for predicting mortality and heart failure (HF) hospitalization for outpatients with HF with preserved ejection fraction (HFpEF) in the TOPCAT (Treatment of Preserved Cardiac Function Heart Failure with an Aldosterone Antagonist) trial. Background Although risk assessment models are available for patients with HF with reduced ejection fraction, few have assessed the risks of death and hospitalization in patients with HFpEF. Methods The following 5 methods: logistic regression with a forward selection of variables; logistic regression with a lasso regularization for variable selection; random forest (RF); gradient descent boosting; and support vector machine, were used to train models for assessing risks of mortality and HF hospitalization through 3 years of follow-up and were validated using 5-fold cross-validation. Model discrimination and calibration were estimated using receiver-operating characteristic curves and Brier scores, respectively. The top prediction variables were assessed by using the best performing models, using the incremental improvement of each variable in 5-fold cross-validation. Results The RF was the best performing model with a mean C-statistic of 0.72 (95% confidence interval [CI]: 0.69 to 0.75) for predicting mortality (Brier score: 0.17), and 0.76 (95% CI: 0.71 to 0.81) for HF hospitalization (Brier score: 0.19). Blood urea nitrogen levels, body mass index, and Kansas City Cardiomyopathy Questionnaire (KCCQ) subscale scores were strongly associated with mortality, whereas hemoglobin level, blood urea nitrogen, time since previous HF hospitalization, and KCCQ scores were the most significant predictors of HF hospitalization. Conclusions These models predict the risks of mortality and HF hospitalization in patients with HFpEF and emphasize the importance of health status data in determining prognosis.

Automated extraction of mutual independence patterns using Bayesian comparison of partition models Machine Learning

Mutual independence is a key concept in statistics that characterizes the structural relationships between variables. Existing methods to investigate mutual independence rely on the definition of two competing models, one being nested into the other and used to generate a null distribution for a statistic of interest, usually under the asymptotic assumption of large sample size. As such, these methods have a very restricted scope of application. In the present manuscript, we propose to change the investigation of mutual independence from a hypothesis-driven task that can only be applied in very specific cases to a blind and automated search within patterns of mutual independence. To this end, we treat the issue as one of model comparison that we solve in a Bayesian framework. We show the relationship between such an approach and existing methods in the case of multivariate normal distributions as well as cross-classified multinomial distributions. We propose a general Markov chain Monte Carlo (MCMC) algorithm to numerically approximate the posterior distribution on the space of all patterns of mutual independence. The relevance of the method is demonstrated on synthetic data as well as two real datasets, showing the unique insight provided by this approach.

Bayesian Consensus Clustering Machine Learning

The task of clustering a set of objects based on multiple sources of data arises in several modern applications. We propose an integrative statistical model that permits a separate clustering of the objects for each data source. These separate clusterings adhere loosely to an overall consensus clustering, and hence they are not independent. We describe a computationally scalable Bayesian framework for simultaneous estimation of both the consensus clustering and the source-specific clusterings. We demonstrate that this flexible approach is more robust than joint clustering of all data sources, and is more powerful than clustering each data source separately. This work is motivated by the integrated analysis of heterogeneous biomedical data, and we present an application to subtype identification of breast cancer tumor samples using publicly available data from The Cancer Genome Atlas. Software is available at

A Machine Learning Approach to Predicting Blood Glucose Levels for Diabetes Management

AAAI Conferences

Patients with diabetes must continually monitor their blood glucose levels and adjust insulin doses, striving to keep blood glucose levels as close to normal as possible. Blood glucose levels that deviate from the normal range can lead to serious short-term and long-term complications. An automatic prediction model that warned people of imminent changes in their blood glucose levels would enable them to take preventive action. In this paper, we describe a solution that uses a generic physiological model of blood glucose dynamics to generate informative features for a Support Vector Regression model that is trained on patient specific data. The new model outperforms diabetes experts at predicting blood glucose levels and could be used to anticipate almost a quarter of hypoglycemic events 30 minutes in advance. Although the corresponding precision is currently just 42%, most false alarms are in near-hypoglycemic regions and therefore patients responding to these hypoglycemia alerts would not be harmed by intervention.