Goto

Collaborating Authors

 Country


To Split or Not to Split: The Impact of Disparate Treatment in Classification

arXiv.org Machine Learning

Disparate treatment occurs when a machine learning model produces different decisions for groups defined by a legally protected or sensitive attribute (e.g., race, gender). In domains where prediction accuracy is paramount, it is acceptable to fit a model which exhibits disparate treatment. We explore the effect of splitting classifiers (i.e., training and deploying a separate classifier on each group) and derive an information-theoretic impossibility result: there exists precise conditions where a group-blind classifier will always have a non-trivial performance gap from the split classifiers. We further demonstrate that, in the finite sample regime, splitting is no longer always beneficial and relies on the number of samples from each group and the complexity of the hypothesis class. We provide data-dependent bounds for understanding the effect of splitting and illustrate these bounds on real-world datasets.


Revisiting Fixed Support Wasserstein Barycenter: Computational Hardness and Efficient Algorithms

arXiv.org Machine Learning

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the linear programming (LP) representation of the FS-WBP is totally unimodular when $m \geq 3$ and $n = 2$, but not totally unimodular when $m \geq 3$ and $n \geq 3$. This result answers an open problem, since it shows that the FS-WBP is not a minimum-cost flow problem and therefore cannot be solved efficiently using linear programming. Building on this negative result, we propose and analyze a simple and efficient variant of the iterative Bregman projection (IBP) algorithm, currently the most widely adopted algorithm to solve the FS-WBP. The algorithm is an accelerated IBP algorithm which achieves the complexity bound of $\widetilde{\mathcal{O}}(mn^{7/3}/\varepsilon)$. This bound is better than that obtained for the standard IBP algorithm---$\widetilde{\mathcal{O}}(mn^{2}/\varepsilon^2)$---in terms of $\varepsilon$, and that of accelerated primal-dual gradient algorithm---$\widetilde{\mathcal{O}}(mn^{5/2}/\varepsilon)$---in terms of $n$. Empirical studies on simulated datasets demonstrate that the acceleration promised by the theory is real in practice.


Deep Transfer Learning for Physiological Signals

arXiv.org Machine Learning

Deep learning is increasingly common in healthcare, yet transfer learning for physiological signals (e.g., temperature, heart rate, etc.) is under-explored. Here, we present a straightforward, yet performant framework for transferring knowledge about physiological signals. Our framework is called PHASE (PHysiologicAl Signal Embeddings). It i) learns deep embeddings of physiological signals and ii) predicts adverse outcomes based on the embeddings. PHASE is the first instance of deep transfer learning in a cross-hospital, cross-department setting for physiological signals. We show that PHASE's per-signal (one for each signal) LSTM embedding functions confer a number of benefits including improved performance, successful transference between hospitals, and lower computational cost.


Distribution-Agnostic Model-Agnostic Meta-Learning

arXiv.org Machine Learning

The Model-Agnostic Meta-Learning (MAML) algorithm \citep{finn2017model} has been celebrated for its efficiency and generality, as it has demonstrated success in quickly learning the parameters of an arbitrary learning model. However, MAML implicitly assumes that the tasks come from a particular distribution, and optimizes the expected (or sample average) loss over tasks drawn from this distribution. Here, we amend this limitation of MAML by reformulating the objective function as a min-max problem, where the maximization is over the set of possible distributions over tasks. Our proposed algorithm is the first distribution-agnostic and model-agnostic meta-learning method, and we show that it converges to an $\epsilon$-accurate point at the rate of $\mathcal{O}(1/\epsilon^2)$ in the convex setting and to an $(\epsilon, \delta)$-stationary point at the rate of $\mathcal{O}(\max\{1/\epsilon^5, 1/\delta^5\})$ in nonconvex settings. We also provide numerical experiments that demonstrate the worst-case superiority of our algorithm in comparison to MAML.


Understanding Global Loss Landscape of One-hidden-layer ReLU Neural Networks

arXiv.org Machine Learning

For one-hidden-layer ReLU networks, we show that all local minima are global in each differentiable region, and these local minima can be unique or continuous, depending on data, activation pattern of hidden neurons and network size. We give criteria to identify whether local minima lie inside their defining regions, and if so (we call them genuine differentiable local minima), their locations and loss values. Furthermore, we give necessary and sufficient conditions for the existence of saddle points as well as non-differentiable local minima. Finally, we compute the probability of getting stuck in genuine local minima for Gaussian input data and parallel weight vectors, and show that it is exponentially vanishing when the weights are located in regions where data are not too scarce. This may give a hint to the question why gradient-based local search methods usually do not get trapped in local minima when training deep ReLU neural networks.


Collaborative Inference for Efficient Remote Monitoring

arXiv.org Machine Learning

While current machine learning models have impressive performance over a wide range of applications, their large size and complexity render them unsuitable for tasks such as remote monitoring on edge devices with limited storage and computational power. A naive approach to resolve this on the model level is to use simpler architectures, but this sacrifices prediction accuracy and is unsuitable for monitoring applications requiring accurate detection of the onset of adverse events. In this paper, we propose an alternative solution to this problem by decomposing the predictive model as the sum of a simple function which serves as a local monitoring tool, and a complex correction term to be evaluated on the server. A sign requirement is imposed on the latter to ensure that the local monitoring function is safe, in the sense that it can effectively serve as an early warning system. Our analysis quantifies the trade-offs between model complexity and performance, and serves as a guidance for architecture design. We validate our proposed framework on a series of monitoring experiments, where we succeed at learning monitoring models with significantly reduced complexity that minimally violate the safety requirement. More broadly, our framework is useful for learning classifiers in applications where false negatives are significantly more costly compared to false positives.


On the Value of Target Data in Transfer Learning

arXiv.org Machine Learning

We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy. To this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents. Interestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions - w.r.t.


On Layer Normalization in the Transformer Architecture

arXiv.org Machine Learning

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.


Fast Geometric Projections for Local Robustness Certification

arXiv.org Machine Learning

Local robustness ensures that a model classifies all inputs within an $\epsilon$-ball consistently, which precludes various forms of adversarial inputs. In this paper, we present a fast procedure for checking local robustness in feed-forward neural networks with piecewise linear activation functions. The key insight is that such networks partition the input space into a polyhedral complex such that the network is linear inside each polyhedral region; hence, a systematic search for decision boundaries within the regions around a given input is sufficient for assessing robustness. Crucially, we show how these regions can be analyzed using geometric projections instead of expensive constraint solving, thus admitting an efficient, highly-parallel GPU implementation at the price of incompleteness, which can be addressed by falling back on prior approaches. Empirically, we find that incompleteness is not often an issue, and that our method performs one to two orders of magnitude faster than existing robustness-certification techniques based on constraint solving.


Generalized Bayesian Cram\'{e}r-Rao Inequality via Information Geometry of Relative $\alpha$-Entropy

arXiv.org Machine Learning

The relative $\alpha$-entropy is the R\'enyi analog of relative entropy and arises prominently in information-theoretic problems. Recent information geometric investigations on this quantity have enabled the generalization of the Cram\'{e}r-Rao inequality, which provides a lower bound for the variance of an estimator of an escort of the underlying parametric probability distribution. However, this framework remains unexamined in the Bayesian framework. In this paper, we propose a general Riemannian metric based on relative $\alpha$-entropy to obtain a generalized Bayesian Cram\'{e}r-Rao inequality. This establishes a lower bound for the variance of an unbiased estimator for the $\alpha$-escort distribution starting from an unbiased estimator for the underlying distribution. We show that in the limiting case when the entropy order approaches unity, this framework reduces to the conventional Bayesian Cram\'{e}r-Rao inequality. Further, in the absence of priors, the same framework yields the deterministic Cram\'{e}r-Rao inequality.