This dissertation presents several new methods of supervised and unsupervised learning of word sense disambiguation models. The supervised methods focus on performing model searches through a space of probabilistic models, and the unsupervised methods rely on the use of Gibbs Sampling and the Expectation Maximization (EM) algorithm. In both the supervised and unsupervised case, the Naive Bayesian model is found to perform well. An explanation for this success is presented in terms of learning rates and bias-variance decompositions.
Machine learning models often excel in the accuracy of their predictions but are opaque due to their non-linear and non-parametric structure. This makes statistical inference challenging and disqualifies them from many applications where model interpretability is crucial. This paper proposes the Shapley regression framework as an approach for statistical inference on non-linear or non-parametric models. Inference is performed based on the Shapley value decomposition of a model, a pay-off concept from cooperative game theory. I show that universal approximators from machine learning are estimation consistent and introduce hypothesis tests for individual variable contributions, model bias and parametric functional forms. The inference properties of state-of-the-art machine learning models - like artificial neural networks, support vector machines and random forests - are investigated using numerical simulations and real-world data. The proposed framework is unique in the sense that it is identical to the conventional case of statistical inference on a linear model if the model is linear in parameters. This makes it a well-motivated extension to more general models and strengthens the case for the use of machine learning to inform decisions.
Artificial Intelligence and its inherent bias may not be as judgmental as previously thought, at least in the case of home loans. It appears the use of algorithms for online mortgage lending can reduce discrimination against certain groups, including minorities, according to a recent study from the National Bureau of Economic Research. This could end up becoming the main tool in closing the racial wealth gap, especially as banks start using AI for lending decisions. The Breakdown You Need to Know: The study found that in person mortgage lenders typically reject minority applicants at a rate 6% higher than those with comparable economic backgrounds. However, when the application was online and involved an algorithm to make the decision, the acceptance and rejection rates were the same.
The distribution of health care payments to insurance plans has substantial consequences for social policy. Risk adjustment formulas predict spending in health insurance markets in order to provide fair benefits and health care coverage for all enrollees, regardless of their health status. Unfortunately, current risk adjustment formulas are known to undercompensate payments to health insurers for specific groups of enrollees (by underpredicting their spending). Much of the existing algorithmic fairness literature for group fairness to date has focused on classifiers and binary outcomes. To improve risk adjustment formulas for undercompensated groups, we expand on concepts from the statistics, computer science, and health economics literature to develop new fair regression methods for continuous outcomes by building fairness considerations directly into the objective function. We additionally propose a novel measure of fairness while asserting that a suite of metrics is necessary in order to evaluate risk adjustment formulas more fully. Our data application using the IBM MarketScan Research Databases and simulation studies demonstrate that these new fair regression methods may lead to massive improvements in group fairness with only small reductions in overall fit.
Generative Adversarial Net (GAN) has been proven to be a powerful machine learning tool in image data analysis and generation . In this paper, we propose to use Conditional Generative Adversarial Net (CGAN)  to learn and simulate time series data. The conditions can be both categorical and continuous variables containing different kinds of auxiliary information. Our simulation studies show that CGAN is able to learn different kinds of normal and heavy tail distributions, as well as dependent structures of different time series and it can further generate conditional predictive distributions consistent with the training data distributions. We also provide an in-depth discussion on the rationale of GAN and the neural network as hierarchical splines to draw a clear connection with the existing statistical method for distribution generation. In practice, CGAN has a wide range of applications in the market risk and counterparty risk analysis: it can be applied to learn the historical data and generate scenarios for the calculation of Value-at-Risk (VaR) and Expected Shortfall (ES) and predict the movement of the market risk factors. We present a real data analysis including a backtesting to demonstrate CGAN is able to outperform the Historic Simulation, a popular method in market risk analysis for the calculation of VaR. CGAN can also be applied in the economic time series modeling and forecasting, and an example of hypothetical shock analysis for economic models and the generation of potential CCAR scenarios by CGAN is given at the end of the paper.