In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies.
In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score.
In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate various types of entropy, including the SG entropy, and a different entropy results in a different class of the optimal policy in Tsallis MDPs. We also provide a full mathematical analysis of Tsallis MDPs, including the optimality condition, performance error bounds, and convergence. Our theoretical result enables us to use any positive entropic index in RL. To handle complex and large-scale problems, we propose a model-free actor-critic RL method using Tsallis entropy maximization. We evaluate the regularization effect of the Tsallis entropy with various values of entropic indices and show that the entropic index controls the exploration tendency of the proposed method. For a different type of RL problems, we find that a different value of the entropic index is desirable. The proposed method is evaluated using the MuJoCo simulator and achieves the state-of-the-art performance.
In density estimation task, maximum entropy model (Maxent) can effectively use reliable prior information via certain constraints, i.e., linear constraints without empirical parameters. However, reliable prior information is often insufficient, and the selection of uncertain constraints becomes necessary but poses considerable implementation complexity. Improper setting of uncertain constraints can result in overfitting or underfitting. To solve this problem, a generalization of Maxent, under Tsallis entropy framework, is proposed. The proposed method introduces a convex quadratic constraint for the correction of (expected) Tsallis entropy bias (TEB). Specifically, we demonstrate that the expected Tsallis entropy of sampling distributions is smaller than the Tsallis entropy of the underlying real distribution. This expected entropy reduction is exactly the (expected) TEB, which can be expressed by a closed-form formula and act as a consistent and unbiased correction. TEB indicates that the entropy of a specific sampling distribution should be increased accordingly. This entails a quantitative re-interpretation of the Maxent principle. By compensating TEB and meanwhile forcing the resulting distribution to be close to the sampling distribution, our generalized TEBC Maxent can be expected to alleviate the overfitting and underfitting. We also present a connection between TEB and Lidstone estimator. As a result, TEB-Lidstone estimator is developed by analytically identifying the rate of probability correction in Lidstone. Extensive empirical evaluation shows promising performance of both TEBC Maxent and TEB-Lidstone in comparison with various state-of-the-art density estimation methods.
Beginning with the seminal work of Hannan , researchers have been interested in algorithms that use random perturbations to generate a distribution over available actions. Kalai and Vempala  showed that the perturbation idealeads to efficient algorithms for many online learning problems with large action sets. Due to the Gumbel lemma [Hazan et al., 2017], the well known exponential weights algorithm [Freund and Schapire, 1997] also has an interpretation as a perturbation based algorithm that uses Gumbel distributed perturbations. There have been several attempts to analyze the regret of perturbation based algorithms with specific distributions such as Uniform, Double-exponential, dropout and random walk (see, e.g., [Kalai and Vempala, 2005, Kujala and Elomaa, 2005, Devroye et al., 2013, Van Erven et al., 2014]). These works provided rigorous guarantees but the techniques they used did not generalize to general perturbations.