to

### Entropy in Soft Actor-Critic (Part 1)

In the probability theory, there are two principles associated with entropy: the principle of maximum entropy and the principle of minimum cross-entropy. At very beginning we notice that there are two types of entropy, however there are more in stock. First of all, let us emphasize that neither the principle of maximum entropy nor the principle of minimum cross-entropy are theorems, they are only principles of statistical inference. This is very similar to philosophical doctrine. However, these doctrines certainly have mathematical implications. So we have two different types of entropy: entropy and cross-entropy.

### The Unbearable Shallowness of "Deep AI"

Since people invented writing, communications technology has become steadily more high-bandwidth, pervasive and persuasive, taking a commensurate toll on human attention and cognition. In that bandwidth war between machines and humans, the machines' latest weapon is a class of statistical algorithm dubbed "deep AI." This computational engine already, at a stroke, conquered both humankind's most cherished mind-game (Go) and our unconscious spending decisions (online). This month, finally, we can read how it happened, and clearly enough to do something. But I'm not just writing a book review, because the interaction of math with brains has been my career and my passion. Plus, I know the author. So, after praising the book, I append an intellectual digest, debunking the hype in favor of undisputed mathematical principles governing both machine and biological information-processing systems. That makes this article unique but long. "Genius Makers: The Mavericks Who Brought AI to Google, Facebook, and the World" is the first book to chronicle the rise of savant-like artificial intelligence (AI), and the last we'll ever need. Investigative journalist Cade Metz lays out the history and the math through the machines' human inventors. The title, "Genius Makers," refers both to the genius-like brilliance of the human makers of AI, as well as to the genius-like brilliance of the AI programs they create. Of all possible AIs, the particular flavor in the book is a class of data-digestion algorithms called deep learning. Metz's book is a ripping good read, paced like a page-turner prodding a reader to discover which of the many genius AI creators will outflank or outthink the others, and how. Together, in collaboration and competition, the computer scientists Metz portrays are inventing and deploying the fastest and most human-impacting revolution in technology to date, the apparently inexorable replacement of human sensation and choice by machine sensation and choice. This is the story of the people designing the bots that do so many things better than us.

### This Tenet Shows Time Travel May Be Possible - Issue 98: Mind

Time travel has been a beloved science-fiction idea at least since H.G. Wells wrote The Time Machine in 1895. The concept continues to fascinate and fictional approaches keep coming, prodding us to wonder whether time travel is physically possible and, for that matter, makes logical sense in the face of its inscrutable paradoxes. Remarkably, last year saw both a science-fiction film that illuminates these questions, and a real scientific result, spelled out in the journal, Classical and Quantum Gravity,1 that may point to answers. The film is writer-director Christopher Nolan's attention-getting Tenet. Like other time travel stories, Tenet uses a time machine.

### Inductive Mutual Information Estimation: A Convex Maximum-Entropy Copula Approach

We propose a novel estimator of the mutual information between two ordinal vectors $x$ and $y$. Our approach is inductive (as opposed to deductive) in that it depends on the data generating distribution solely through some nonparametric properties revealing associations in the data, and does not require having enough data to fully characterize the true joint distributions $P_{x, y}$. Specifically, our approach consists of (i) noting that $I\left(y; x\right) = I\left(u_y; u_x\right)$ where $u_y$ and $u_x$ are the copula-uniform dual representations of $y$ and $x$ (i.e. their images under the probability integral transform), and (ii) estimating the copula entropies $h\left(u_y\right)$, $h\left(u_x\right)$ and $h\left(u_y, u_x\right)$ by solving a maximum-entropy problem over the space of copula densities under a constraint of the type $\alpha_m = E\left[\phi_m(u_y, u_x)\right]$. We prove that, so long as the constraint is feasible, this problem admits a unique solution, it is in the exponential family, and it can be learned by solving a convex optimization problem. The resulting estimator, which we denote MIND, is marginal-invariant, always non-negative, unbounded for any sample size $n$, consistent, has MSE rate $O(1/n)$, and is more data-efficient than competing approaches. Beyond mutual information estimation, we illustrate that our approach may be used to mitigate mode collapse in GANs by maximizing the entropy of the copula of fake samples, a model we refer to as Copula Entropy Regularized GAN (CER-GAN).

### Deep Hedging: Learning Risk-Neutral Implied Volatility Dynamics

We present a numerically efficient approach for learning a risk-neutral measure for paths of simulated spot and option prices up to a finite horizon under convex transaction costs and convex trading constraints. This approach can then be used to implement a stochastic implied volatility model in the following two steps: 1. Train a market simulator for option prices, as discussed for example in our recent work Bai et al. (2019); 2. Find a risk-neutral density, specifically the minimal entropy martingale measure. The resulting model can be used for risk-neutral pricing, or for Deep Hedging (Buehler et al., 2019) in the case of transaction costs or trading constraints. To motivate the proposed approach, we also show that market dynamics are free from "statistical arbitrage" in the absence of transaction costs if and only if they follow a risk-neutral measure. We additionally provide a more general characterization in the presence of convex transaction costs and trading constraints. These results can be seen as an analogue of the fundamental theorem of asset pricing for statistical arbitrage under trading frictions and are of independent interest.

### Towards interpretability of Mixtures of Hidden Markov Models

Mixtures of Hidden Markov Models (MHMMs) are frequently used for clustering of sequential data. An important aspect of MHMMs, as of any clustering approach, is that they can be interpretable, allowing for novel insights to be gained from the data. However, without a proper way of measuring interpretability, the evaluation of novel contributions is difficult and it becomes practically impossible to devise techniques that directly optimize this property. In this work, an information-theoretic measure (entropy) is proposed for interpretability of MHMMs, and based on that, a novel approach to improve model interpretability is proposed, i.e., an entropy-regularized Expectation Maximization (EM) algorithm. The new approach aims for reducing the entropy of the Markov chains (involving state transition matrices) within an MHMM, i.e., assigning higher weights to common state transitions during clustering. It is argued that this entropy reduction, in general, leads to improved interpretability since the most influential and important state transitions of the clusters can be more easily identified. An empirical investigation shows that it is possible to improve the interpretability of MHMMs, as measured by entropy, without sacrificing (but rather improving) clustering performance and computational costs, as measured by the v-measure and number of EM iterations, respectively.

### Maximum Entropy Reinforcement Learning with Mixture Policies

Mixture models are an expressive hypothesis class that can approximate a rich set of policies. However, using mixture policies in the Maximum Entropy (MaxEnt) framework is not straightforward. The entropy of a mixture model is not equal to the sum of its components, nor does it have a closed-form expression in most cases. Using such policies in MaxEnt algorithms, therefore, requires constructing a tractable approximation of the mixture entropy. In this paper, we derive a simple, low-variance mixture-entropy estimator. We show that it is closely related to the sum of marginal entropies. Equipped with our entropy estimator, we derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.

### Understanding the origin of information-seeking exploration in probabilistic objectives for control

The exploration-exploitation trade-off is central to the description of adaptive behaviour in fields ranging from machine learning, to biology, to economics. While many approaches have been taken, one approach to solving this trade-off has been to equip or propose that agents possess an intrinsic 'exploratory drive' which is often implemented in terms of maximizing the agents information gain about the world -- an approach which has been widely studied in machine learning and cognitive science. In this paper we mathematically investigate the nature and meaning of such approaches and demonstrate that this combination of utility maximizing and information-seeking behaviour arises from the minimization of an entirely difference class of objectives we call divergence objectives. We propose a dichotomy in the objective functions underlying adaptive behaviour between \emph{evidence} objectives, which correspond to well-known reward or utility maximizing objectives in the literature, and \emph{divergence} objectives which instead seek to minimize the divergence between the agent's expected and desired futures, and argue that this new class of divergence objectives could form the mathematical foundation for a much richer understanding of the exploratory components of adaptive and intelligent action, beyond simply greedy utility maximization.

### Function approximation by deep neural networks with parameters $\{0,\pm \frac{1}{2}, \pm 1, 2\}$

In this paper it is shown that $C_\beta$-smooth functions can be approximated by neural networks with parameters $\{0,\pm \frac{1}{2}, \pm 1, 2\}$. The depth, width and the number of active parameters of constructed networks have, up to a logarithimc factor, the same dependence on the approximation error as the networks with parameters in $[-1,1]$. In particular, this means that the nonparametric regression estimation with constructed networks attain the same convergence rate as with the sparse networks with parameters in $[-1,1]$.

### The Relationship Between Perplexity And Entropy In NLP

Perplexity is a common metric to use when evaluating language models. For example, scikit-learn's implementation of Latent Dirichlet Allocation (a topic-modeling algorithm) includes perplexity as a built-in metric. In this post, I will define perplexity and then discuss entropy, the relation between the two, and how it arises naturally in natural language processing applications. A quite general setup in many Natural Language tasks is that you have a language L and want to build a model M for the language. The "language" could be a specific genre/corpus like "English Wikipedia", "Nigerian Twitter", or "Shakespeare" or (conceptually at least) just a generic like "French." Specifically by a language L, we mean a process for generating text.