Goto

Collaborating Authors

 Country


Markov chains in random environment with applications in queueing theory and machine learning

arXiv.org Machine Learning

We prove the existence of limiting distributions for a large class of Markov chains on a general state space in a random environment. We assume suitable versions of the standard drift and minorization conditions. In particular, the system dynamics should be contractive on the average with respect to the Lyapunov function and large enough small sets should exist with large enough minorization constants. We also establish that a law of large numbers holds for bounded functionals of the process. Applications to queuing systems and to machine learning algorithms are presented.


Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

arXiv.org Machine Learning

Deep neural networks have gained remarkable success over a l arge variety of applications, including computer vision [ 1 ], natural language processing [ 2 ], speech recognition [ 3 ] and Go games [ 4 ]. But the reason why deep networks perform well over various tasks is still not exactly understood. The optimization performance of deep networks is one of the subj ects which requires an involved theoretical study, given that gradient descent can achieve zero training loss even for random labels [ 5 ], and the loss of deep networks is highly non-convex. There are different lines of works investigating the optimization of deep networks from different perspec tives. For example, a large number of works consider the optimization landscape correspondin g to different activation functions [ 6 - 11 ], whereas some others [ 12 - 15 ] ensure global convergence by imposing some restrictions o n the input distribution. In the recent years, there has been considerably many papers providing convergence guarantees for over-parameterized two-layer and deep networks. It is s hown in [ 16 ] that gradient descent can find the near-global minima of a single hidden layer network i n polynomial time with respect to the accuracy and sample size.


Systematic Comparison of the Influence of Different Data Preprocessing Methods on the Classification of Gait Using Machine Learning

arXiv.org Machine Learning

Human movements are characterized by highly non-linear and multi-dimensional interactions within the motor system. Recently, an increasing emphasis on machine-learning applications has led to a significant contribution to the field of gait analysis e.g. in increasing the classification accuracy. In order to ensure the generalizability of the machine-learning models, different data preprocessing steps are usually carried out to process the measured raw data before the classifications. In the past, various methods have been used for each of these preprocessing steps. However, there are hardly any standard procedures or rather systematic comparisons of these different methods and their impact on the classification accuracy. Therefore, the aim of this analysis is to compare different combinations of commonly applied data preprocessing steps and test their effects on the classification accuracy of gait patterns. A publicly available dataset on intra-individual changes of gait patterns was used for this analysis. Forty-two healthy subjects performed 6 sessions of 15 gait trials for one day. For each trial, two force plates recorded the 3D ground reaction forces (GRF). The data was preprocessed with the following steps: GRF filtering, time derivative, time normalization, data reduction, weight normalization and data scaling. Subsequently, combinations of all methods from each individual preprocessing step were analyzed and compared with respect to their prediction accuracy in a six-session classification using Support Vector Machines, Random Forest Classifiers and Multi-Layer Perceptrons. In conclusion, the present results provide first domain-specific recommendations for commonly applied data preprocessing methods and might help to build more comparable and more robust classification models based on machine learning that are suitable for a practical application.


Kernel Dependence Regularizers and Gaussian Processes with Applications to Algorithmic Fairness

arXiv.org Machine Learning

Current adoption of machine learning in industrial, societal and economical activities has raised concerns about the fairness, equity and ethics of automated decisions. Predictive models are often developed using biased datasets and thus retain or even exacerbate biases in their decisions and recommendations. Removing the sensitive covariates, such as gender or race, is insufficient to remedy this issue since the biases may be retained due to other related covariates. We present a regularization approach to this problem that trades off predictive accuracy of the learned models (with respect to biased labels) for the fairness in terms of statistical parity, i.e. independence of the decisions from the sensitive covariates. In particular, we consider a general framework of regularized empirical risk minimization over reproducing kernel Hilbert spaces and impose an additional regularizer of dependence between predictors and sensitive covariates using kernel-based measures of dependence, namely the Hilbert-Schmidt Independence Criterion (HSIC) and its normalized version. This approach leads to a closed-form solution in the case of squared loss, i.e. ridge regression. Moreover, we show that the dependence regularizer has an interpretation as modifying the corresponding Gaussian process (GP) prior. As a consequence, a GP model with a prior that encourages fairness to sensitive variables can be derived, allowing principled hyperparameter selection and studying of the relative relevance of covariates under fairness constraints. Experimental results in synthetic examples and in real problems of income and crime prediction illustrate the potential of the approach to improve fairness of automated decisions.


Learning The Best Expert Efficiently

arXiv.org Machine Learning

We consider online learning problems where the aim is to achieve regret which is efficient in the sense that it is the same order as the lowest regret amongst K experts. This is a substantially stronger requirement that achieving $O(\sqrt{n})$ or $O(\log n)$ regret with respect to the best expert and standard algorithms are insufficient, even in easy cases where the regrets of the available actions are very different from one another. We show that a particular lazy form of the online subgradient algorithm can be used to achieve minimal regret in a number of "easy" regimes while retaining an $O(\sqrt{n})$ worst-case regret guarantee. We also show that for certain classes of problem minimal regret strategies exist for some of the remaining "hard" regimes.


Rethinking Generalisation

arXiv.org Machine Learning

Vision, Learning and Control University of Southampton Southampton, UK Abstract In this paper, we present a new approach to computing the generalisation performance assuming that the distribution of risks, ρ (r), for a learning scenario is known. This allows us to compute the expected error of a learning machine using empirical risk minimisation. We show that it is possible to obtain results for both classification and regression. We show a critical quantity in determining the generalisation performance is the power-law behaviour of ρ ( r) around its minimum value. We compute ρ ( r) for the case of all Boolean functions and for the perceptron. We start with a simplistic analysis but then do a more formal one later on. We show that the simplistic results are qualitatively correct and provide a good approximation to the actual results if we replace the true training set size with an approximate training set size. Keywords: Generalisation, Learning Theory 1. Introduction Traditional computational learning theory aims to eliminate all rules that do not correctly explain the data. A rule can be thought of as a fixed set of parameters of a learning machine; more formally, a hypothesis. This process relies on the idea that rules with poor generalisation performance (high risk) will, with high probability, make errors on a sufficiently large randomly chosen training data set (Vapnik and Chervonenkis, 1971; Valiant, 1984; Baum and Haussler, 1989; Blumer et al., 1989; Haussler, 1992; Vapnik, 1992). Suppose there exists a mechanism for selecting a rule from the subset of rules that have the lowest errors on the training set. Then, there is a very small probability that any of the selected rules has a high risk. However, this crucially depends on there being effectively a finite number of hypotheses, otherwise, there could still be a high-risk set of parameters which by chance did well on the particular training set. In the case where the learning machine has a continuous parameter space (so that the dimensionality of the space is uncountably infinite), we consider the effective size of the hypothesis space to be the Vapnik-Chervonenkis (VC) dimension. The VC dimension measures the number of possible ways in which the machine can give different outputs to a finite number of training examples (Vapnik and Chervonenkis, 1971). This effective size or capacity lies at the heart of conventional computational learning theory. By limiting the capacity we can obtain stronger bounds on the generalisation performance. In this paper, we challenge this traditional approach.


Error bound of local minima and KL property of exponent 1/2 for squared F-norm regularized factorization

arXiv.org Machine Learning

This paper is concerned with the squared F(robenius)-norm regularized factorization form for noisy low-rank matrix recovery problems. Under a suitable assumption on the restricted condition number of the Hessian matrix of the loss function, we establish an error bound to the true matrix for those local minima whose ranks are not more than the rank of the true matrix. Then, for the least squares loss function, we achieve the KL property of exponent 1/2 for the F-norm regularized factorization function over its global minimum set under a restricted strong convexity assumption. These theoretical findings are also confirmed by applying an accelerated alternating minimization method to the F-norm regularized factorization problem.


Machine Learning-Based Adaptive Receive Filtering: Proof-of-Concept on an SDR Platform

arXiv.org Machine Learning

Conventional multiuser detection techniques either require a large number of antennas at the receiver for a desired performance, or they are too complex for practical implementation. Moreover, many of these techniques, such as successive interference cancellation (SIC), suffer from errors in parameter estimation (user channels, covariance matrix, noise variance, etc.) that is performed before detection of user data symbols. As an alternative to conventional methods, this paper proposes and demonstrates a low-complexity practical Machine Learning (ML) based receiver that achieves similar (and at times better) performance to the SIC receiver. The proposed receiver does not require parameter estimation; instead it uses supervised learning to detect the user modulation symbols directly. We perform comparisons with minimum mean square error (MMSE) and SIC receivers in terms of symbol error rate (SER) and complexity.


Self-training with Noisy Student improves ImageNet classification

arXiv.org Machine Learning

We present a simple self-training method that achieves 87.4% top-1 accuracy on ImageNet, which is 1.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 16.6% to 74.2%, reduces ImageNet-C mean corruption error from 45.7 to 31.2, and reduces ImageNet-P mean flip rate from 27.8 to 16.1. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as good as possible. But during the learning of the student, we inject noise such as data augmentation, dropout, stochastic depth to the student so that the noised student is forced to learn harder from the pseudo labels.


Practical Federated Gradient Boosting Decision Trees

arXiv.org Machine Learning

Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secure sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties' data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each owner, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.