Gradient Descent
Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning
Islam, Riashat, Seraj, Raihan, Arnob, Samin Yeasar, Precup, Doina
We study the problem of off-policy critic evaluation in several variants of value-based off-policy actor-critic algorithms. Off-policy actor-critic algorithms require an off-policy critic evaluation step, to estimate the value of the new policy after every policy gradient update. Despite enormous success of off-policy policy gradients on control tasks, existing general methods suffer from high variance and instability, partly because the policy improvement depends on gradient of the estimated value function. In this work, we present a new way of off-policy policy evaluation in actor-critic, based on the doubly robust estimators. We extend the doubly robust estimator from off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a reward estimator performance model. We find that doubly robust estimation of the critic can significantly improve performance in continuous control tasks. Furthermore, in cases where the reward function is stochastic that can lead to high variance, doubly robust critic estimation can improve performance under corrupted, stochastic reward signals, indicating its usefulness for robust and safe reinforcement learning.
Statistically Robust Neural Network Classification
Wang, Benjie, Webb, Stefan, Rainforth, Tom
Recently there has been much interest in quantifying the robustness of neural network classifiers through adversarial risk metrics. However, for problems where test-time corruptions occur in a probabilistic manner, rather than being generated by an explicit adversary, adversarial metrics typically do not provide an accurate or reliable indicator of robustness. To address this, we introduce a statistically robust risk (SRR) framework which measures robustness in expectation over both network inputs and a corruption distribution. Unlike many adversarial risk metrics, which typically require separate applications on a point-by-point basis, the SRR can easily be directly estimated for an entire network and used as a training objective in a stochastic gradient scheme. Furthermore, we show both theoretically and empirically that it can scale to higher-dimensional networks by providing superior generalization performance compared with comparable adversarial risks.
Analysis of the Optimization Landscapes for Overcomplete Representation Learning
Qu, Qing, Zhai, Yuexiang, Li, Xiao, Zhang, Yuqian, Zhu, Zhihui
We study nonconvex optimization landscapes for learning overcomplete representations, including learning (i) sparsely used overcomplete dictionaries and (ii) convolutional dictionaries, where these unsupervised learning problems find many applications in high-dimensional data analysis. Despite the empirical success of simple nonconvex algorithms, theoretical justifications of why these methods work so well are far from satisfactory. In this work, we show these problems can be formulated as $\ell^4$-norm optimization problems with spherical constraint, and study the geometric properties of their nonconvex optimization landscapes. For both problems, we show the nonconvex objectives have benign (global) geometric structures, in the sense that every local minimizer is close to one of the target solutions and every saddle point exhibits negative curvature. This discovery enables the development of guaranteed global optimization methods using simple initializations. For both problems, we show the nonconvex objectives have benign geometric structures -- every local minimizer is close to one of the target solutions and every saddle point exhibits negative curvature -- either in the entire space or within a sufficiently large region. This discovery ensures local search algorithms (such as Riemannian gradient descent) with simple initializations approximately find the target solutions. Finally, numerical experiments justify our theoretical discoveries.
How to implement Gradient Descent in Python
We will try to build a single neuron network, which can predict the admissions of a graduate school. The data we will use is shared above in google drive. The first 5 rows of data are shown below. The first column admit indicates whether the student is getting admitted to the school or not, this will be the target for our model; the second column gre and the third column gpa are numerical features for the student; the fourth column rank is a categorical feature. We will apply one-hot encoding to the categorical feature to add dummy columns.
Learn Electronic Health Records by Fully Decentralized Federated Learning
Lu, Songtao, Zhang, Yawen, Wang, Yunlong, Mack, Christina
Federated learning opens a number of research opportunities due to its high communication efficiency in distributed training problems within a star network. In this paper, we focus on improving the communication efficiency for fully decentralized federated learning over a graph, where the algorithm performs local updates for several iterations and then enables communications among the nodes. In such a way, the communication rounds of exchanging the common interest of parameters can be saved significantly without loss of optimality of the solutions. Multiple numerical simulations based on large, real-world electronic health record databases showcase the superiority of the decentralized federated learning compared with classic methods.
Multi-Gradient Descent for Multi-Objective Recommender Systems
Milojkovic, Nikola, Antognini, Diego, Bergamin, Giancarlo, Faltings, Boi, Musat, Claudiu
Recommender systems need to mirror the complexity of the environment they are applied in. The more we know about what might benefit the user, the more objectives the recommender system has. In addition there may be multiple stakeholders - sellers, buyers, shareholders - in addition to legal and ethical constraints. Simultaneously optimizing for a multitude of objectives, correlated and not correlated, having the same scale or not, has proven difficult so far. We introduce a stochastic multi-gradient descent approach to recommender systems (MGDRec) to solve this problem. We show that this exceeds state-of-the-art methods in traditional objective mixtures, like revenue and recall. Not only that, but through gradient normalization we can combine fundamentally different objectives, having diverse scales, into a single coherent framework. We show that uncorrelated objectives, like the proportion of quality products, can be improved alongside accuracy. Through the use of stochasticity, we avoid the pitfalls of calculating full gradients and provide a clear setting for its applicability.
Winning the Lottery with Continuous Sparsification
Savarese, Pedro, Silva, Hugo, Maire, Michael
The Lottery Ticket Hypothesis from Frankle & Carbin (2019) conjectures that, for typically-sized neural networks, it is possible to find small sub-networks which train faster and yield superior performance than their original counterparts. The proposed algorithm to search for "winning tickets", Iterative Magnitude Pruning, consistently finds sub-networks with $90-95\%$ less parameters which train faster and better than the overparameterized models they were extracted from, creating potential applications to problems such as transfer learning. In this paper, we propose Continuous Sparsification, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network's structure with gradient-based methods instead of relying on pruning strategies. We show empirically that our method is capable of finding tickets that outperforms the ones learned by Iterative Magnitude Pruning, and at the same time providing faster search, when measured in number of training epochs or wall-clock time.
Optimizing Rank-based Metrics with Blackbox Differentiation
Rolínek, Michal, Musil, Vít, Paulus, Anselm, Vlastelica, Marin, Michaelis, Claudio, Martius, Georg
Rank-based metrics are some of the most widely used criteria for performance evaluation of computer vision models. Despite years of effort, direct optimization for these metrics remains a challenge due to their non-differentiable and non-decomposable nature. We present an efficient, theoretically sound, and general method for differentiating rank-based metrics with mini-batch gradient descent. In addition, we address optimization instability and sparsity of the supervision signal that both arise from using rank-based metrics as optimization targets. Resulting losses based on recall and Average Precision are applied to image retrieval and object detection tasks. We obtain performance that is competitive with state-of-the-art on standard image retrieval datasets and consistently improve performance of near state-of-the-art object detectors.
A Hybrid Approach Towards Two Stage Bengali Question Classification Utilizing Smart Data Balancing Technique
Rahman, Md. Hasibur, Rahman, Chowdhury Rafeed, Amin, Ruhul, Sifat, Md. Habibur Rahman, Anika, Afra
Question classification (QC) is the primary step of the Question Answering (QA) system. Question Classification (QC) system classifies the questions in particular classes so that Question Answering (QA) System can provide correct answers for the questions. Our system categorizes the factoid type questions asked in natural language after extracting features of the questions. We present a two stage QC system for Bengali. It utilizes one dimensional convolutional neural network for classifying questions into coarse classes in the first stage. Word2vec representation of existing words of the question corpus have been constructed and used for assisting 1D CNN. A smart data balancing technique has been employed for giving data hungry convolutional neural network the advantage of a greater number of effective samples to learn from. For each coarse class, a separate Stochastic Gradient Descent (SGD) based classifier has been used in order to differentiate among the finer classes within that coarse class. TF-IDF representation of each word has been used as feature for the SGD classifiers implemented as part of second stage classification. Experiments show the effectiveness of our proposed method for Bengali question classification.
A Quasi-Newton Method Based Vertical Federated Learning Framework for Logistic Regression
Data privacy and security becomes a major concern in building machine learning models from different data providers. Federated learning shows promise by leaving data at providers locally and exchanging encrypted information. This paper studies the vertical federated learning structure for logistic regression where the data sets at two parties have the same sample IDs but own disjoint subsets of features. Existing frameworks adopt the first-order stochastic gradient descent algorithm, which requires large number of communication rounds. To address the communication challenge, we propose a quasi-Newton method based vertical federated learning framework for logistic regression under the additively homomorphic encryption scheme.