Goto

Collaborating Authors

 demystifying


Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Zhou, Hongyi, Hanna, Josiah P., Zhu, Jin, Yang, Ying, Shi, Chengchun

arXiv.org Machine Learning

This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.


Demystifying the Optimal Performance of Multi-Class Classification

Neural Information Processing Systems

Classification is a fundamental task in science and engineering on which machine learning methods have shown outstanding performances. However, it is challenging to determine whether such methods have achieved the Bayes error rate, that is, the lowest error rate attained by any classifier. This is mainly due to the fact that the Bayes error rate is not known in general and hence, effectively estimating it is paramount. Inspired by the work by Ishida et al. (2023), we propose an estimator for the Bayes error rate of supervised multi-class classification problems. We analyze several theoretical aspects of such estimator, including its consistency, unbiasedness, convergence rate, variance, and robustness.


Demystifying the Hypercomplex: Inductive Biases in Hypercomplex Deep Learning

Comminiello, Danilo, Grassucci, Eleonora, Mandic, Danilo P., Uncini, Aurelio

arXiv.org Artificial Intelligence

Hypercomplex algebras have recently been gaining prominence in the field of deep learning owing to the advantages of their division algebras over real vector spaces and their superior results when dealing with multidimensional signals in real-world 3D and 4D paradigms. This paper provides a foundational framework that serves as a roadmap for understanding why hypercomplex deep learning methods are so successful and how their potential can be exploited. Such a theoretical framework is described in terms of inductive bias, i.e., a collection of assumptions, properties, and constraints that are built into training algorithms to guide their learning process toward more efficient and accurate solutions. We show that it is possible to derive specific inductive biases in the hypercomplex domains, which extend complex numbers to encompass diverse numbers and data structures. These biases prove effective in managing the distinctive properties of these domains, as well as the complex structures of multidimensional and multimodal signals. This novel perspective for hypercomplex deep learning promises to both demystify this class of methods and clarify their potential, under a unifying framework, and in this way promotes hypercomplex models as viable alternatives to traditional real-valued deep learning for multidimensional signal processing.


Demystifying the Myths and Legends of Nonconvex Convergence of SGD

Dutta, Aritra, Bergou, El Houcine, Boucherouite, Soumia, Werge, Nicklas, Kandemir, Melih, Li, Xin

arXiv.org Artificial Intelligence

Stochastic gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems with nonconvex objective functions. Although the convergence of SGDs in the (strongly) convex case is well-understood, their convergence for nonconvex functions stands on weak mathematical foundations. Most existing studies on the nonconvex convergence of SGD show the complexity results based on either the minimum of the expected gradient norm or the functional sub-optimality gap (for functions with extra structural property) by searching the entire range of iterates. Hence the last iterations of SGDs do not necessarily maintain the same complexity guarantee. This paper shows that an $\epsilon$-stationary point exists in the final iterates of SGDs, given a large enough total iteration budget, $T$, not just anywhere in the entire range of iterates -- a much stronger result than the existing one. Additionally, our analyses allow us to measure the density of the $\epsilon$-stationary points in the final iterates of SGD, and we recover the classical $O(\frac{1}{\sqrt{T}})$ asymptotic rate under various existing assumptions on the objective function and the bounds on the stochastic gradient. As a result of our analyses, we addressed certain myths and legends related to the nonconvex convergence of SGD and posed some thought-provoking questions that could set new directions for research.


Demystifying the Magic: The Importance of Machine Learning Explainability

#artificialintelligence

Machine learning explainability refers to the ability to understand and interpret the reasoning behind the predictions made by a machine learning model. It is important for ensuring transparency and accountability in the decision-making process. Explainable AI techniques, such as feature importance analysis and model interpretability, help to provide insights into how a model arrives at its output. This can help to detect and prevent bias, increase trust in AI systems, and facilitate regulatory compliance. Model insights, also known as model interpretability or explainability, refer to the ability to understand how a machine learning model works and why it makes certain predictions or decisions.


Demystifying the Random Forest. Deconstructing and Understanding this…

#artificialintelligence

In classical Machine Learning, Random Forests have been a silver bullet type of model. In this post, I want to better understand the components that make up a Random Forest. To accomplish this, I am going to deconstruct the Random Forest into its most basic components and explain what is going on in each level of computation. By the end, we will have attained a much deeper understanding of how Random Forests work and how to work with them with more intuition. The examples we will use will be focused on classification, but many of the principles apply to the regression scenarios as well. Let's start by invoking a classic Random Forest pattern.


Demystifying the five 'sights' of artificial intelligence

#artificialintelligence

Artificial Intelligence as a tech category has become so broad that it's nearly lost all meaning. It encompasses everything from chatbots to autonomous vehicles to scenes from Terminator 2. This ambiguity impacts AI's adoption across many businesses and increases the desire for data privacy protections and greater accountability in AI. Recently, President Biden unveiled an AI Bill of Rights designed to set an AI framework and new standards for AI in government. So how can agency leaders move AI forward responsibly and with confidence? Demystifying and defining what AI can do, or can't do, is the first step in the process.


Demystifying the Modern AI Stack

#artificialintelligence

The rise of artificial intelligence (AI) means that it is more important than ever for developers and engineers to deploy AI projects more quickly and at greater scale across an organization. At the same time, there has been a boom of AI tools and services designed for different purposes, which has made it challenging to evaluate all of them in the quickly evolving environment.


Demystifying the Adversarial Robustness of Random Transformation Defenses

Sitawarin, Chawin, Golan-Strieb, Zachary, Wagner, David

arXiv.org Artificial Intelligence

Neural networks' lack of robustness against attacks raises concerns in security-sensitive settings such as autonomous vehicles. While many countermeasures may look promising, only a few withstand rigorous evaluation. Defenses using random transformations (RT) have shown impressive results, particularly BaRT (Raff et al., 2019) on ImageNet. However, this type of defense has not been rigorously evaluated, leaving its robustness properties poorly understood. Their stochastic properties make evaluation more challenging and render many proposed attacks on deterministic models inapplicable. First, we show that the BPDA attack (Athalye et al., 2018a) used in BaRT's evaluation is ineffective and likely overestimates its robustness. We then attempt to construct the strongest possible RT defense through the informed selection of transformations and Bayesian optimization for tuning their parameters. Furthermore, we create the strongest possible attack to evaluate our RT defense. Our new attack vastly outperforms the baseline, reducing the accuracy by 83% compared to the 19% reduction by the commonly used EoT attack ($4.3\times$ improvement). Our result indicates that the RT defense on the Imagenette dataset (a ten-class subset of ImageNet) is not robust against adversarial examples. Extending the study further, we use our new attack to adversarially train RT defense (called AdvRT), resulting in a large robustness gain. Code is available at https://github.com/wagner-group/demystify-random-transform.


Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions

He, Peng

arXiv.org Machine Learning

This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space $\mapsto$ spectrums of overparameterized activation matrices $\mapsto$ favorable geometrical properties of empirical loss landscape $\mapsto$ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.