Known finite-sample concentration bounds for the Wasserstein distance between the empirical and true distribution of a random variable are used to derive a two-sided concentration bound for the error between the true conditional value-at-risk (CVaR) of a (possibly unbounded) random variable and a standard estimate of its CVaR computed from an i.i.d. The bound applies under fairly general assumptions on the random variable, and improves upon previous bounds which were either one sided, or applied only to bounded random variables. Specializations of the bound to sub-Gaussian and sub-exponential random variables are also derived. A similar procedure is followed to derive concentration bounds for the error between the true and estimated Cumulative Prospect Theory (CPT) value of a random variable, in cases where the random variable is bounded or sub-Gaussian. These bounds are shown to match a known bound in the bounded case, and improve upon the known bound in the sub-Gaussian case.
Computer simulations often involve both qualitative and numerical inputs. Existing Gaussian process (GP) methods for handling this mainly assume a different response surface for each combination of levels of the qualitative factors and relate them via a multiresponse cross-covariance matrix. We introduce a substantially different approach that maps each qualitative factor to an underlying numerical latent variable (LV), with the mapped value for each level estimated similarly to the covariance lengthscale parameters. This provides a parsimonious GP parameterization that treats qualitative factors the same as numerical variables and views them as effecting the response via similar physical mechanisms. This has strong physical justification, as the effects of a qualitative factor in any physics-based simulation model must always be due to some underlying numerical variables. Even when the underlying variables are many, sufficient dimension reduction arguments imply that their effects can be represented by a low-dimensional LV. This conjecture is supported by the superior predictive performance observed across a variety of examples. Moreover, the mapped LVs provide substantial insight into the nature and effects of the qualitative factors.
We argue that second-order Markov logic is ideally suited for this purpose and propose an approach based on it. Our algorithm discovers structural regularities in the source domain in the form of Markov logic formulas with predicate variables and instantiates these formulas with predicates from the target domain. Our approach has successfully transferred learned knowledge among molecular biology, web, and social network domains. For example, Wall Street firms often hire physicists to solve finance problems. Even though these two domains have superficially nothing in common, training as a physicist provides knowledge and skills that are highly applicable in finance (for example, solving differential equations and performing Monte Carlo simulations).
Machine learning algorithms are increasingly used to make decisions around assessing employee performance and turnover, identifying and preventing recidivism, and assessing job suitability. These algorithms are cheap to scale and sometimes can be cheap to develop in an environment filled with good quality research and tooling. Unfortunately, one thing these algorithms don't prevent and automate is how to structure your data and training pipeline in such a way that it does not lead to bias and negative self-reinforcing loops. This article will cover solutions for faithfully removing bias in your algorithms using bias-aware rather than blindless approaches. Predictions from an algorithm are generally not used in a vacuum.
In any learning task, it is natural to incorporate latent or hidden variables which are not directly observed. For instance, in a social network, we can observe interactions among the actors, but not their hidden interests/intents, in gene networks, we can measure gene expression levels but not the detailed regulatory mechanisms, and so on. I will present a broad framework for unsupervised learning of latent variable models, addressing both statistical and computational concerns. We show that higher order relationships among observed variables have a low rank representation under natural statistical constraints such as conditional-independence relationships. These findings have implications in a number of settings such as finding hidden communities in networks, discovering topics in text documents and learning about gene regulation in computational biology.