With the overall increase in the elderly population comes additional, necessary medical needs and costs. Medicare is a U.S. healthcare program that provides insurance, primarily to individuals 65 years or older, to offload some of the financial burden associated with medical care. Even so, healthcare costs are high and continue to increase. Fraud is a major contributor to these inflating healthcare expenses. Our paper provides a comprehensive study leveraging machine learning methods to detect fraudulent Medicare providers. We use publicly available Medicare data and provider exclusions for fraud labels to build and assess three different learners. In order to lessen the impact of class imbalance, given so few actual fraud labels, we employ random undersampling creating four class distributions. Our results show that the C4.5 decision tree and logistic regression learners have the best fraud detection performance, particularly for the 80:20 class distribution with average AUC scores of 0.883 and 0.882, respectively, and low false negative rates. We successfully demonstrate the efficacy of employing machine learning with random undersampling to detect Medicare fraud.
Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins - with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram - when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.
We develop a model using deep learning techniques and natural language processing on unstructured text from medical records to predict hospital-wide $30$-day unplanned readmission, with c-statistic $.70$. Our model is constructed to allow physicians to interpret the significant features for prediction.
Hospital Readmissions within 30 days after discharge following Coronary Artery Bypass Graft (CABG) Surgery are substantial contributors to healthcare costs. Many predictive models were developed to identify risk factors for readmissions. However, majority of the existing models use statistical analysis techniques with data available at discharge. We propose an ensembled model to predict CABG readmissions using pre-discharge perioperative data and machine learning survival analysis techniques. Firstly, we applied fifty one potential readmission risk variables to Cox Proportional Hazard (CPH) survival regression univariate analysis. Fourteen of them turned out to be significant (with p value < 0.05), contributing to readmissions. Subsequently, we applied these 14 predictors to multivariate CPH model and Deep Learning Neural Network (NN) representation of the CPH model, DeepSurv. We validated this new ensembled model with 453 isolated adult CABG cases. Nine of the fourteen perioperative risk variables were identified as the most significant with Hazard Ratios (HR) of greater than 1.0. The concordance index metrics for CPH, DeepSurv, and ensembled models were then evaluated with training and validation datasets. Our ensembled model yielded promising results in terms of c-statistics, as we raised the the number of iterations and data set sizes. 30 day all-cause readmissions among isolated CABG patients can be predicted more effectively with perioperative pre-discharge data, using machine learning survival analysis techniques. Prediction accuracy levels could be improved further with deep learning algorithms.
Anomalies in healthcare claims data can be indicative of possible fraudulent activities, contributing to a significant portion of overall healthcare costs. Medicare is a large government run healthcare program that serves the needs of the elderly in the United States. The increasing elderly population and their reliance on the Medicare program create an environment with rising costs and increased risk of fraud. The detection of these potentially fraudulent activities can recover costs and lessen the overall impact of fraud on the Medicare program. In this paper, we propose a new method to detect fraud by discovering outliers, or anomalies, in payments made to Medicare providers. We employ a multivariate outlier detection method split into two parts. In the first part, we create a multivariate regression model and generate corresponding residuals. In the second part, these residuals are used as inputs into a generalizable univariate probability model. We create this Bayesian probability model using probabilistic programming. Our results indicate our model is robust and less dependent on underlying data distributions, versus Mahalanobis distance. Moreover, we are able to demonstrate successful anomaly detection, within Medicare specialties, providing meaningful results for further investigation.