Country
Loop estimator for discounted values in Markov reward processes
Dai, Falcon Z., Walter, Matthew R.
At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of state with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called \emph{loop estimator} that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state $s$ unlike TD (with $O(S)$) or model-based methods (with $O(S^2)$). Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instance-dependent convergence rate of $\widetilde{O}(\sqrt{\tau_s/T})$ over steps $T$ on a single sample path, where $\tau_s$ is the maximal expected hitting time to state $s$. In preliminary numerical experiments, the loop estimator outperforms model-free methods, such as TD(k), and is competitive with the model-based estimator.
Extreme Classification via Adversarial Softmax Approximation
Bamler, Robert, Mandt, Stephan
Training a classifier over a large number of classes, known as 'extreme classification', has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes $C$, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in $C$, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines.
Let Me At Least Learn What You Really Like: Dealing With Noisy Humans When Learning Preferences
Gopalakrishnan, Sriram, Soni, Utkarsh
Learning the preferences of a human improves the quality of the interaction with the human. The number of queries available to learn preferences maybe limited especially when interacting with a human, and so active learning is a must. One approach to active learning is to use uncertainty sampling to decide the informativeness of a query. In this paper, we propose a modification to uncertainty sampling which uses the expected output value to help speed up learning of preferences. We compare our approach with the uncertainty sampling baseline, as well as conduct an ablation study to test the validity of each component of our approach.
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling
Xiong, Huaqing, Xu, Tengyu, Liang, Yingbin, Zhang, Wei
Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $\mathcal{O}(1/T)$ (where $T$ denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of $\mathcal{O}(\log^2 T/\sqrt{T})$. Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $\mathcal{O}(1/T)$, and with a diminishing stepsize converges exactly to the global optimum at the rate of $\mathcal{O}(\log T/\sqrt{T})$. Our study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.
Higher order co-occurrence tensors for hypergraphs via face-splitting
A popular trick for computing a pairwise co-occurrence matrix is the product of an incidence matrix and its transpose. We present an analog for higher order tuple co-occurrences using the face-splitting product, or alternately known as the transpose Khatri-Rao product. These higher order co-occurrences encode the commonality of tokens in the company of other tokens, and thus generalize the mutual information commonly studied. We demonstrate this tensor's use via a popular NLP model, and hypergraph models of similarity. Studying an implicit meaning of a collection of things via their relationships with other things of the same type is a popular technique.
Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? -- A Neural Tangent Kernel Perspective
Huang, Kaixuan, Wang, Yuqing, Tao, Molei, Zhao, Tuo
Deep residual networks (ResNets) have demonstrated better generalization performance than deep feedforward networks (FFNets). However, the theory behind such a phenomenon is still largely unknown. This paper studies this fundamental problem in deep learning from a so-called "neural tangent kernel" perspective. Specifically, we first show that under proper conditions, as the width goes to infinity, training deep ResNets can be viewed as learning reproducing kernel functions with some kernel function. We then compare the kernel of deep ResNets with that of deep FFNets and discover that the class of functions induced by the kernel of FFNets is asymptotically not learnable, as the depth goes to infinity. In contrast, the class of functions induced by the kernel of ResNets does not exhibit such degeneracy. Our discovery partially justifies the advantages of deep ResNets over deep FFNets in generalization abilities. Numerical results are provided to support our claim.
Robust Policies For Proactive ICU Transfers
Grand-Clement, Julien, Chan, Carri W., Goyal, Vineet, Escobar, Gabriel
Patients whose transfer to the Intensive Care Unit (ICU) is unplanned are prone to higher mortality rates than those who were admitted directly to the ICU. Recent advances in machine learning to predict patient deterioration have introduced the possibility of \emph{proactive transfer} from the ward to the ICU. In this work, we study the problem of finding \emph{robust} patient transfer policies which account for uncertainty in statistical estimates due to data limitations when optimizing to improve overall patient care. We propose a Markov Decision Process model to capture the evolution of patient health, where the states represent a measure of patient severity. Under fairly general assumptions, we show that an optimal transfer policy has a threshold structure, i.e., that it transfers all patients above a certain severity level to the ICU (subject to available capacity). As model parameters are typically determined based on statistical estimations from real-world data, they are inherently subject to misspecification and estimation errors. We account for this parameter uncertainty by deriving a robust policy that optimizes the worst-case reward across all plausible values of the model parameters. We show that the robust policy also has a threshold structure under fairly general assumptions. Moreover, it is more aggressive in transferring patients than the optimal nominal policy, which does not take into account parameter uncertainty. We present computational experiments using a dataset of hospitalizations at 21 KNPC hospitals, and present empirical evidence of the sensitivity of various hospital metrics (mortality, length-of-stay, average ICU occupancy) to small changes in the parameters. Our work provides useful insights into the impact of parameter uncertainty on deriving simple policies for proactive ICU transfer that have strong empirical performance and theoretical guarantees.
Wind speed prediction using a hybrid model of the multi-layer perceptron and whale optimization algorithm
Samadianfard, Saeed, Hashemi, Sajjad, Kargar, Katayoun, Izadyar, Mojtaba, Mostafaeipour, Ali, Mosavi, Amir, Nabipour, Narjes, Shamshirband, Shahaboddin
Wind power as a renewable source of energy, has numerous economic, environmental and social benefits. In order to enhance and control renewable wind power, it is vital to utilize models that predict wind speed with high accuracy. Due to neglecting of requirement and significance of data preprocessing and disregarding the inadequacy of using a single predicting model, many traditional models have poor performance in wind speed prediction. In the current study, for predicting wind speed at target stations in the north of Iran, the combination of a multi-layer perceptron model (MLP) with the Whale Optimization Algorithm (WOA) used to build new method (MLP-WOA) with a limited set of data (2004-2014). Then, the MLP-WOA model was utilized at each of the ten target stations, with the nine stations for training and tenth station for testing (namely: Astara, Bandar-E-Anzali, Rasht, Manjil, Jirandeh, Talesh, Kiyashahr, Lahijan, Masuleh, and Deylaman) to increase the accuracy of the subsequent hybrid model. The capability of the hybrid model in wind speed forecasting at each target station was compared with the MLP model without the WOA optimizer. To determine definite results, numerous statistical performances were utilized. For all ten target stations, the MLP-WOA model had precise outcomes than the standalone MLP model. The hybrid model had acceptable performances with lower amounts of the RMSE, SI and RE parameters and higher values of NSE, WI, and KGE parameters. It was concluded that the WOA optimization algorithm can improve the prediction accuracy of MLP model and may be recommended for accurate wind speed prediction.
Top-K Training of GANs: Improving Generators by Making Critics Less Critical
Sinha, Samarth, Goyal, Anirudh, Raffel, Colin, Odena, Augustus
We introduce a simple (one line of code) modification to the Generative Adversarial Network (GAN) training algorithm that materially improves results with no increase in computational cost: When updating the generator parameters, we simply zero out the gradient contributions from the elements of the batch that the critic scores as `least realistic'. Through experiments on many different GAN variants, we show that this `top-k update' procedure is a generally applicable improvement. In order to understand the nature of the improvement, we conduct extensive analysis on a simple mixture-of-Gaussians dataset and discover several interesting phenomena. Among these is that, when gradient updates are computed using the worst-scoring batch elements, samples can actually be pushed further away from the their nearest mode.
Electricity Theft Detection with self-attention
Finardi, Paulo, Campiotti, Israel, Plensack, Gustavo, de Souza, Rafael Derradi, Nogueira, Rodrigo, Pinheiro, Gustavo, Lotufo, Roberto
In this work we propose a novel self-attention mechanism model to address electricity theft detection on an imbalanced realistic dataset that presents a daily electricity consumption provided by State Grid Corporation of China. Our key contribution is the introduction of a multi-head self-attention mechanism concatenated with dilated convolutions and unified by a convolution of kernel size $1$. Moreover, we introduce a binary input channel (Binary Mask) to identify the position of the missing values, allowing the network to learn how to deal with these values. Our model achieves an AUC of $0.926$ which is an improvement in more than $17\%$ with respect to previous baseline work. The code is available on GitHub at https://github.com/neuralmind-ai/electricity-theft-detection-with-self-attention.