hidden unit
A Supplementary Material A.1 Proof Theorem 1. Assuming a family
The classifier for density ratio estimation is based on deep neural networks. Our algorithm and baselines are based on the identical outcome prediction network architecture. The results are reported in Table 4. A.5 Results of synthetic experiments under misspecification of the dimension of latent space The dimension of latent factors is hardly known in many scenarios.
Hybrid Bernstein Normalizing Flows for Flexible Multivariate Density Regression with Interpretable Marginals
Arpogaus, Marcel, Kneib, Thomas, Nagler, Thomas, Rรผgamer, David
Density regression models allow a comprehensive understanding of data by modeling the complete conditional probability distribution. While flexible estimation approaches such as normalizing flows (NF) work particularly well in multiple dimensions, interpreting the input-output relationship of such models is often difficult, due to the black-box character of deep learning models. In contrast, existing statistical methods for multivariate outcomes such as multivariate conditional transformation models (MCTM) are restricted in flexibility and are often not expressive enough to represent complex multivariate probability distributions. In this paper, we combine MCTM with state-of-the-art and autoregressive NF to leverage the transparency of MCTM for modeling interpretable feature effects on the marginal distributions in the first step and the flexibility of neural-network-based NF techniques to account for complex and non-linear relationships in the joint data distribution. We demonstrate our method's versatility in various numerical experiments and compare it with MCTM and other NF models on both simulated and real-world data.
Vesicoureteral Reflux Detection with Reliable Probabilistic Outputs
Papadopoulos, Harris, Anastassopoulos, George
Vesicoureteral Reflux (VUR) is a pediatric disorder in which urine flows backwards from the bladder to the upper urinary tract. Its detection is of great importance as it increases the risk of a Urinary Tract Infection, which can then lead to a kidney infection since bacteria may have direct access to the kidneys. Unfortunately the detection of VUR requires a rather painful medical examination, called voiding cysteourethrogram (VCUG), that exposes the child to radiation. In an effort to avoid the exposure to radiation required by VCUG some recent studies examined the use of machine learning techniques for the detection of VUR based on data that can be obtained without exposing the child to radiation. This work takes one step further by proposing an approach that provides lower and upper bounds for the conditional probability of a given child having VUR. The important property of these bounds is that they are guaranteed (up to statistical fluctuations) to contain well-calibrated probabilities with the only requirement that observations are independent and identically distributed (i.i.d.). Therefore they are much more informative and reliable than the plain yes/no answers provided by other techniques.
A Back-Propagation Algorithm with Optimal Use of Hidden Units
This paper presents a variation of the back-propagation algo(cid:173) rithm that makes optimal use of a network hidden units by de(cid:173) cr asing an "energy" term written as a function of the squared activations of these hidden units. The algorithm can automati(cid:173) cally find optimal or nearly optimal architectures necessary to solve known Boolean functions, facilitate the interpretation of the activation of the remaining hidden units and automatically estimate the complexity of architectures appropriate for phonetic labeling problems. The general principle of the algorithm can also be adapted to different tasks: for example, it can be used to eliminate the [0, 0] local minimum of the [-1.
Preferential Temporal Difference Learning
Anand, Nishanth, Precup, Doina
Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
Understanding the Role of Adversarial Regularization in Supervised Learning
Despite numerous attempts sought to provide empirical evidence of adversarial regularization outperforming sole supervision, the theoretical understanding of such phenomena remains elusive. In this study, we aim to resolve whether adversarial regularization indeed performs better than sole supervision at a fundamental level. To bring this insight into fruition, we study vanishing gradient issue, asymptotic iteration complexity, gradient flow and provable convergence in the context of sole supervision and adversarial regularization. The key ingredient is a theoretical justification supported by empirical evidence of adversarial acceleration in gradient descent. In addition, motivated by a recently introduced unit-wise capacity based generalization bound, we analyze the generalization error in adversarial framework. Guided by our observation, we cast doubts on the ability of this measure to explain generalization. We therefore leave as open questions to explore new measures that can explain generalization behavior in adversarial learning. Furthermore, we observe an intriguing phenomenon in the neural embedded vector space while contrasting adversarial learning with sole supervision.
Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks
Ghiassian, Sina, Rafiee, Banafsheh, Lo, Yat Long, White, Adam
Reinforcement learning systems require good representations to work well. For decades practical success in reinforcement learning was limited to small domains. Deep reinforcement learning systems, on the other hand, are scalable, not dependent on domain specific prior knowledge and have been successfully used to play Atari, in 3D navigation from pixels, and to control high degree of freedom robots. Unfortunately, the performance of deep reinforcement learning systems is sensitive to hyper-parameter settings and architecture choices. Even well tuned systems exhibit significant instability both within a trial and across experiment replications. In practice, significant expertise and trial and error are usually required to achieve good performance. One potential source of the problem is known as catastrophic interference: when later training decreases performance by overriding previous learning. Interestingly, the powerful generalization that makes Neural Networks (NN) so effective in batch supervised learning might explain the challenges when applying them in reinforcement learning tasks. In this paper, we explore how online NN training and interference interact in reinforcement learning. We find that simply re-mapping the input observations to a high-dimensional space improves learning speed and parameter sensitivity. We also show this preprocessing reduces interference in prediction tasks. More practically, we provide a simple approach to NN training that is easy to implement, and requires little additional computation. We demonstrate that our approach improves performance in both prediction and control with an extensive batch of experiments in classic control domains.
The Goal-Gradient Hypothesis in Stack Overflow
Hoernle, Nicholas, Kehne, Gregory, Procaccia, Ariel D., Gal, Kobi
According to the goal-gradient hypothesis, people increase their efforts toward a reward as they close in on the reward. This hypothesis has recently been used to explain users' behavior in online communities that use badges as rewards for completing specific activities. In such settings, users exhibit a "steering effect," a dramatic increase in activity as the users approach a badge threshold, thereby following the predictions made by the goal-gradient hypothesis. This paper provides a new probabilistic model of users' behavior, which captures users who exhibit different levels of steering. We apply this model to data from the popular Q&A site, Stack Overflow, and study users who achieve one of the badges available on this platform. Our results show that only a fraction (20%) of all users strongly experience steering, whereas the activity of more than 40% of badge achievers appears not to be affected by the badge. In particular, we find that for some of the population, an increased activity in and around the badge acquisition date may reflect a statistical artifact rather than steering, as was previously thought in prior work. These results are important for system designers who hope to motivate and guide their users towards certain actions. We have highlighted the need for further studies which investigate what motivations drive the non-steered users to contribute to online communities.