This dissertation presents several new methods of supervised and unsupervised learning of word sense disambiguation models. The supervised methods focus on performing model searches through a space of probabilistic models, and the unsupervised methods rely on the use of Gibbs Sampling and the Expectation Maximization (EM) algorithm. In both the supervised and unsupervised case, the Naive Bayesian model is found to perform well. An explanation for this success is presented in terms of learning rates and bias-variance decompositions.
We investigate a long-debated question, which is how to create predictive models of recidivism that are sufficiently accurate, transparent, and interpretable to use for decision-making. This question is complicated as these models are used to support different decisions, from sentencing, to determining release on probation, to allocating preventative social services. Each use case might have an objective other than classification accuracy, such as a desired true positive rate (TPR) or false positive rate (FPR). Each (TPR, FPR) pair is a point on the receiver operator characteristic (ROC) curve. We use popular machine learning methods to create models along the full ROC curve on a wide range of recidivism prediction problems. We show that many methods (SVM, Ridge Regression) produce equally accurate models along the full ROC curve. However, methods that designed for interpretability (CART, C5.0) cannot be tuned to produce models that are accurate and/or interpretable. To handle this shortcoming, we use a new method known as SLIM (Supersparse Linear Integer Models) to produce accurate, transparent, and interpretable models along the full ROC curve. These models can be used for decision-making for many different use cases, since they are just as accurate as the most powerful black-box machine learning models, but completely transparent, and highly interpretable.
Anomalies in healthcare claims data can be indicative of possible fraudulent activities, contributing to a significant portion of overall healthcare costs. Medicare is a large government run healthcare program that serves the needs of the elderly in the United States. The increasing elderly population and their reliance on the Medicare program create an environment with rising costs and increased risk of fraud. The detection of these potentially fraudulent activities can recover costs and lessen the overall impact of fraud on the Medicare program. In this paper, we propose a new method to detect fraud by discovering outliers, or anomalies, in payments made to Medicare providers. We employ a multivariate outlier detection method split into two parts. In the first part, we create a multivariate regression model and generate corresponding residuals. In the second part, these residuals are used as inputs into a generalizable univariate probability model. We create this Bayesian probability model using probabilistic programming. Our results indicate our model is robust and less dependent on underlying data distributions, versus Mahalanobis distance. Moreover, we are able to demonstrate successful anomaly detection, within Medicare specialties, providing meaningful results for further investigation.
Buller, Mark Jonathan (Brown University) | Sodomka, Eric (Brown University) | Tharion, William (United States Army Research Institute of Environmental Medicine) | Clements, Cynthia (United States Army Research Institute of Environmental Medicine) | Hoyt, Reed (United States Army Research Institute of Environmental Medicine) | Jenkins, Odest Chadwicke (Brown University)
Emergency workers engaged in strenuous work in hot environments risk overheating and mission failure. We describe a real-time application that would reduce these risks in terms of a real-time thermal-work strain index (SI) estimator; and a Markov Decision Process (MDP) to compute optimal work rate policies. We examined the thermo-physiological responses of 14 experienced U.S. Army Ranger students (26±4 years 1.77±0.04 m; 78.3±7.3 kg) who participated in a strenuous 8 mile time-restricted pass/fail road march conducted under thermally stressful conditions. A thermoregulatory model was used to derive SI state transition probabilities and model the students’ observed and policy driven movement rates. We found that policy end-state SI was significantly lower than SI when modeled using the student’s own movement rates (3.94±0.88 vs. 5.62±1.20, P<0.001). We also found an inverse relationship between our policy impact and maximum SI (r=0.64 P<0.05). These results suggest that modeling real world missions as an MDP can provide optimal work rate policies that improve thermal safety and allow students to finish in a “fresher” state. Ultimately, SI state estimation and MDP models incorporated into wearable physiological monitoring systems could provide real-time work rate guidance, thus minimizing thermal work-strain while maximizing the likelihood of accomplishing mission tasks.
Twitter has been increasingly used for spreading messages about campaigns. Such campaigns try to gain followers through their Twitter accounts, influence the followers and spread messages through them. In this paper, we explore the relationship between followers’ sentiment towards the cam-paign topic and their rate of retweeting of messages gener-ated by the campaign. Our analysis with followers of mul-tiple social-media campaigns found statistical significant correlations between such sentiment and retweeting rate. Based on our analysis, we have conducted an online inter-vention study among the followers of different social-media campaigns. Our study shows that targeting followers based on their sentiment towards the campaign can give higher re-tweet rate than a number of other baseline approaches.