importance function
Multiple importance sampling for stochastic gradient estimation
Salaün, Corentin, Huang, Xingchang, Georgiev, Iliyan, Mitra, Niloy J., Singh, Gurprit
We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.
State Space Models are Comparable to Transformers in Estimating Functions with Dynamic Smoothness
Nishikawa, Naoki, Suzuki, Taiji
Deep neural networks based on state space models (SSMs) are attracting much attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been primarily investigated through experimental comparisons, theoretical understanding of SSMs is still limited. In particular, there is a lack of statistical and quantitative evaluation of whether SSM can replace Transformers. In this paper, we theoretically explore in which tasks SSMs can be alternatives of Transformers from the perspective of estimating sequence-to-sequence functions. We consider the setting where the target function has direction-dependent smoothness and prove that SSMs can estimate such functions with the same convergence rate as Transformers. Additionally, we prove that SSMs can estimate the target function, even if the smoothness changes depending on the input sequence, as well as Transformers. Our results show the possibility that SSMs can replace Transformers when estimating the functions in certain classes that appear in practice.
Adaptive Testing Environment Generation for Connected and Automated Vehicles with Dense Reinforcement Learning
Yang, Jingxuan, Bai, Ruoxuan, Ji, Haoyuan, Zhang, Yi, Hu, Jianming, Feng, Shuo
The assessment of safety performance plays a pivotal role in the development and deployment of connected and automated vehicles (CAVs). A common approach involves designing testing scenarios based on prior knowledge of CAVs (e.g., surrogate models), conducting tests in these scenarios, and subsequently evaluating CAVs' safety performances. However, substantial differences between CAVs and the prior knowledge can significantly diminish the evaluation efficiency. In response to this issue, existing studies predominantly concentrate on the adaptive design of testing scenarios during the CAV testing process. Yet, these methods have limitations in their applicability to high-dimensional scenarios. To overcome this challenge, we develop an adaptive testing environment that bolsters evaluation robustness by incorporating multiple surrogate models and optimizing the combination coefficients of these surrogate models to enhance evaluation efficiency. We formulate the optimization problem as a regression task utilizing quadratic programming. To efficiently obtain the regression target via reinforcement learning, we propose the dense reinforcement learning method and devise a new adaptive policy with high sample efficiency. Essentially, our approach centers on learning the values of critical scenes displaying substantial surrogate-to-real gaps. The effectiveness of our method is validated in high-dimensional overtaking scenarios, demonstrating that our approach achieves notable evaluation efficiency.
SurvBeNIM: The Beran-Based Neural Importance Model for Explaining the Survival Models
Utkin, Lev V., Eremenko, Danila Y., Konstantinov, Andrei V.
One of the important types of data in several applications is censored survival data processed in the framework of survival analysis [1, 2]. This type of data can be found in applications where objects are characterized by times to some events of interest, for example, by times to failure in reliability, times to recovery or times to death in medicine, times to bankruptcy of a bank or times to an economic crisis in economics. The important peculiarity of survival data is that the corresponding event does not necessarily occur during its observation period. In this case, we say about the so-called censored or right-censored data [3]. There are many machine learning models dealing with survival data, including models based on applying and extending the Cox proportional hazard model [4], for example, models presented in [5, 6], models based on a survival modification of random forests and called random survival forests (RSF) [7, 8, 9, 10, 11], models extending the neural networks [6, 12, 13, 14]. These models have gained considerable attention for their ability to analyze time-to-event data and to predict survival outcomes accurately. However, most models are perceived as black boxes, lacking interpretability.
Inherent Inconsistencies of Feature Importance
Harel, Nimrod, Obolski, Uri, Gilad-Bachrach, Ran
The rapid advancement and widespread adoption of machine learning-driven technologies have underscored the practical and ethical need for creating interpretable artificial intelligence systems. Feature importance, a method that assigns scores to the contribution of individual features on prediction outcomes, seeks to bridge this gap as a tool for enhancing human comprehension of these systems. Feature importance serves as an explanation of predictions in diverse contexts, whether by providing a global interpretation of a phenomenon across the entire dataset or by offering a localized explanation for the outcome of a specific data point. Furthermore, feature importance is being used both for explaining models and for identifying plausible causal relations in the data, independently from the model. However, it is worth noting that these various contexts have traditionally been explored in isolation, with limited theoretical foundations. This paper presents an axiomatic framework designed to establish coherent relationships among the different contexts of feature importance scores. Notably, our work unveils a surprising conclusion: when we combine the proposed properties with those previously outlined in the literature, we demonstrate the existence of an inconsistency. This inconsistency highlights that certain essential properties of feature importance scores cannot coexist harmoniously within a single framework.
Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling
Salaün, Corentin, Huang, Xingchang, Georgiev, Iliyan, Mitra, Niloy J., Singh, Gurprit
Machine learning problems rely heavily on stochastic gradient descent (SGD) for optimization. The effectiveness of SGD is contingent upon accurately estimating gradients from a mini-batch of data samples. Instead of the commonly used uniform sampling, adaptive or importance sampling reduces noise in gradient estimation by forming mini-batches that prioritize crucial data points. Previous research has suggested that data points should be selected with probabilities proportional to their gradient norm. Nevertheless, existing algorithms have struggled to efficiently integrate importance sampling into machine learning frameworks. In this work, we make two contributions. First, we present an algorithm that can incorporate existing importance functions into our framework. Second, we propose a simplified importance function that relies solely on the loss gradient of the output layer. By leveraging our proposed gradient estimation techniques, we observe improved convergence in classification and regression tasks with minimal computational overhead. Stochastic gradient descent (SGD) combined with back-propagation and efficient gradient techniques--such as Adam [12]--has unlocked a realm of possibilities.
Machine Learning Explainability
One simple method is Permutation Feature Importance, It is a model inspection technique that can be used for any fitted estimator when the data is tabular. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. A good practice is to drop one of the correlated features based on domain understanding and try to apply the Permutation Feature Importance algorithm which will provide better feature understanding. Let's discuss another method to interpret the black box models.
Weighted Empirical Risk Minimization: Sample Selection Bias Correction based on Importance Sampling
Vogel, Robin, Achab, Mastane, Clémençon, Stéphan, Tillier, Charles
ABSTRACT We consider statistical learning problems, when the distribution P ′ of the training observations Z ′ 1,..., Z′ n differs from the distribution P involved in the risk one seeks to minimize (referred to as the test distribution) but is still defined on the same measurable space as P and dominates it. In the unrealistic case where the likelihood ratio Φ(z) dP/dP ′ (z) is known, one may straightforwardly extends the Empirical Risk Minimization (ERM) approach to this specific transfer learning setup using the same idea as that behind Importance Sampling, by minimizing a weighted version of the empirical risk functional computed from the'biased' training data Zi ′ with weights Φ(Zi ′). Although the importance function Φ(z) is generally unknown in practice, we show that, in various situations frequently encountered in practice, it takes a simple form and can be directly estimated from the Zi ′ 's and some auxiliary information on the statistical population P. By means of linearization techniques, we then prove that the generalization capacity of the approach aforementioned is preserved when plugging the resulting estimates of the Φ(Zi ′)'s into the weighted empirical risk. Beyond these theoretical guarantees, numerical results provide strong empirical evidence of the relevance of the approach promoted in this article. Keywords: Statistical Learning Theory, Importance Sampling, Transfer Learning. 1 Introduction Prediction problems are of major importance in statistical learning. The main paradigm of predictive learning is Empirical Risk Minimization (ERM in abbreviated form), see e.g. In the standard setup, Z is a random variable (r.v. in short) that takes its values in a feature space Z with distribution P, Θ is a parameter space and l: Θ Z R is a (measurable) loss function. The risk is then defined by: θ Θ, R P (θ) E P [l(θ, Z)], (1) and more generally for any measure Q on Z: R Q (θ) l(θ, z)dQ(z).
Confidence Inference in Bayesian Networks
Cheng, Jian, Druzdzel, Marek J.
We present two sampling algorithms for probabilistic confidence inference in Bayesian networks. These two algorithms (we call them AIS-BN-mu and AIS-BN-sigma algorithms) guarantee that estimates of posterior probabilities are with a given probability within a desired precision bound. Our algorithms are based on recent advances in sampling algorithms for (1) estimating the mean of bounded random variables and (2) adaptive importance sampling in Bayesian networks. In addition to a simple stopping rule for sampling that they provide, the AIS-BN-mu and AIS-BN-sigma algorithms are capable of guiding the learning process in the AIS-BN algorithm. An empirical evaluation of the proposed algorithms shows excellent performance, even for very unlikely evidence.
An Importance Sampling Algorithm Based on Evidence Pre-propagation
Yuan, Changhe, Druzdzel, Marek J.
Precision achieved by stochastic sampling algorithms for Bayesian networks typically deteriorates in face of extremely unlikely evidence. To address this problem, we propose the Evidence Pre-propagation Importance Sampling algorithm (EPIS-BN), an importance sampling algorithm that computes an approximate importance function by the heuristic methods: loopy belief Propagation and e-cutoff. We tested the performance of e-cutoff on three large real Bayesian networks: ANDES, CPCS, and PATHFINDER. We observed that on each of these networks the EPIS-BN algorithm gives us a considerable improvement over the current state of the art algorithm, the AIS-BN algorithm. In addition, it avoids the costly learning stage of the AIS-BN algorithm.