With the overall increase in the elderly population comes additional, necessary medical needs and costs. Medicare is a U.S. healthcare program that provides insurance, primarily to individuals 65 years or older, to offload some of the financial burden associated with medical care. Even so, healthcare costs are high and continue to increase. Fraud is a major contributor to these inflating healthcare expenses. Our paper provides a comprehensive study leveraging machine learning methods to detect fraudulent Medicare providers. We use publicly available Medicare data and provider exclusions for fraud labels to build and assess three different learners. In order to lessen the impact of class imbalance, given so few actual fraud labels, we employ random undersampling creating four class distributions. Our results show that the C4.5 decision tree and logistic regression learners have the best fraud detection performance, particularly for the 80:20 class distribution with average AUC scores of 0.883 and 0.882, respectively, and low false negative rates. We successfully demonstrate the efficacy of employing machine learning with random undersampling to detect Medicare fraud.
Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins - with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram - when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.
We develop a model using deep learning techniques and natural language processing on unstructured text from medical records to predict hospital-wide $30$-day unplanned readmission, with c-statistic $.70$. Our model is constructed to allow physicians to interpret the significant features for prediction.
Hospital Readmissions within 30 days after discharge following Coronary Artery Bypass Graft (CABG) Surgery are substantial contributors to healthcare costs. Many predictive models were developed to identify risk factors for readmissions. However, majority of the existing models use statistical analysis techniques with data available at discharge. We propose an ensembled model to predict CABG readmissions using pre-discharge perioperative data and machine learning survival analysis techniques. Firstly, we applied fifty one potential readmission risk variables to Cox Proportional Hazard (CPH) survival regression univariate analysis. Fourteen of them turned out to be significant (with p value < 0.05), contributing to readmissions. Subsequently, we applied these 14 predictors to multivariate CPH model and Deep Learning Neural Network (NN) representation of the CPH model, DeepSurv. We validated this new ensembled model with 453 isolated adult CABG cases. Nine of the fourteen perioperative risk variables were identified as the most significant with Hazard Ratios (HR) of greater than 1.0. The concordance index metrics for CPH, DeepSurv, and ensembled models were then evaluated with training and validation datasets. Our ensembled model yielded promising results in terms of c-statistics, as we raised the the number of iterations and data set sizes. 30 day all-cause readmissions among isolated CABG patients can be predicted more effectively with perioperative pre-discharge data, using machine learning survival analysis techniques. Prediction accuracy levels could be improved further with deep learning algorithms.
Using a highly sophisticated form of pattern matching, researchers from Florida Atlantic University's College of Engineering and Computer Science are teaching "machines" to detect Medicare fraud. About $19 billion to $65 billion is lost every year because of Medicare fraud, waste, or abuse. Like the proverbial "needle in a haystack," human auditors or investigators have the painstaking task of manually checking thousands of Medicare claims for specific patterns that could indicate foul play or fraudulent behaviors. Furthermore, according to the U.S. Department of Justice, right now fraud enforcement efforts rely heavily on health care professionals coming forward with information about Medicare fraud. "The Effects of Varying Class Distribution on Learner Behavior for Medicare Fraud Detection With Imbalanced Big Data," published in the journal Health Information Science and Systems, uses big data from Medicare Part B and employs advanced data analytics and machine learning to automate the fraud detection process.