Goto

Collaborating Authors

 Accuracy


Decoupling Shrinkage and Selection for the Bayesian Quantile Regression

arXiv.org Machine Learning

While modern day economics, and broadly social science research, is often faced with high dimensional estimation problems in which the number of potential explanatory variables is large, often larger than the number of sample observations, the extant literature for high dimensional methods has focused developments mainly on for conditional mean models. Moving beyond the conditional mean, by estimating quantile regression on the other hand, allows to gauge potentially heterogeneous effects of variables directly across the conditional response distribution. While highly influential in the risk-management and finance literature in calculating risk measures such as VaR (i.e., the loss a portfolio's value incurs at a specific probability level), quantile regression has experienced a recent surge in popularity within the macroeconomic literature to quantify risks and vulnerabilities of output growth in response to summary measures of financial health, aptly named growth-at-risk (GaR) (Adrian et al., 2019; Figueres and Jarociński, 2020; Adams et al., 2020). As an important distinction to literature that focuses on forecasting crisis periods directly such as through Markov-switching models (Hubrich and Tetlow, 2015; Guérin and Marcellino, 2013) or probit models (McCracken et al., 2021), GaR instead gives information about the accumulation of risks facing an economy. Since sources of risk can be numerous, high dimensional quantile problems are becoming ever more pertinent to policy makers and practitioners alike which has spurned methods that deal with variable selection and shrinkage for the quantile regression problem (Chernozhukov et al., 2010; Kohns and Szendrei, 2020; Hasenzagl et al., 2020).


Naive Bayes Classifier Tutorial in Python and Scikit-Learn

#artificialintelligence

Naive Bayes Classifier is a simple model that's usually used in classification problems. Despite being simple, it has shown very good results, outperforming by far other, more complicated models. This is the second article in a series of two about the Naive Bayes Classifier and it will deal with the implementation of the model in Scikit-Learn with Python. For a detailed overview of the math and the principles behind the model, please check the other article: Naive Bayes Classifier Explained. In the previous article linked above, I introduced a table of some data that we can train our classifier on.


20 Data Science Interview Questions for a Beginner

#artificialintelligence

Success is a process not an event. Data Science is growing rapidly in all sectors. With the availability of so many technologies within the Data Science domain, it becomes tricky to crack any Data Science interview. In this article, we have tried to cover the most common Data Science interview questions asked by recruiters. Answer: The question can also be phrased as to why linear regression is not a very effective algorithm.


Why we moved away from Conversion Rate as a primary metric

#artificialintelligence

Changing the primary metric for any company is not a straightforward thing to do. This shortlist alone would cause anyone to break out into a sweat! So imagine my surprise when in my first month at Gousto, the Head of Product for our Menu Tribe suggested that our primary metric, Menu Conversion Rate (MCVR) was likely to be the wrong metric for us to be using for our experiments as it violated assumptions of statistical independence. When you use conversion as a metric for your experiments, you do so, either consciously or subconsciously, under the assumption that you have a sample of independent observations. This basically means that the outcome of one event has no impact on the outcome of another event.


Active learning for online training in imbalanced data streams under cold start

arXiv.org Machine Learning

Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (with 1/10 to 1/50 of the labels).


Mastering XGBoost

#artificialintelligence

In the case of XGBoost, it is more useful to discuss hyperparameter tuning than the underlying mathematics because hyperparameter tuning is unusually complex, time-consuming, and necessary for deployment, whereas the mathematics are already embedded in the code libraries. While manual hyperparameter tuning is essential and time-consuming in many machine learning algorithms or models, it is especially so in XGBoost. Therefore, while this section focuses on identifying a key element to deploying XGBoost -- in our case study and example here to predict new fashions ("fast fashion") to gain competitive advantage in online apparel sales -- these hyperparameter tuning lessons are valid for all applications of XGBoost, and many other machine learning model applications herein also. The distinction and roles of parameters and hyperparameters is critical to affordable, timely, and accurate machine learning deployments. A core benefit to machine learning is its ability to discover and identify patterns and regularities in Big Data by automatically tuning many thousands or millions of "learnable" parameters. For example, in tree-based models like XGBoost (and decision trees and random forests), these learnable parameters are how many decision variables are at each node.


Confronting Abusive Language Online: A Survey from the Ethical and Human Rights Perspective

Journal of Artificial Intelligence Research

The pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. Significant effort in Natural Language Processing (NLP) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. Although current technologies achieve high classification performance in research studies, it has been observed that the real-life application of this technology can cause unintended harms, such as the silencing of under-represented groups. We review a large body of NLP research on automatic abuse detection with a new focus on ethical challenges, organized around eight established ethical principles: privacy, accountability, safety and security, transparency and explainability, fairness and non-discrimination, human control of technology, professional responsibility, and promotion of human values. In many cases, these principles relate not only to situational ethical codes, which may be context-dependent, but are in fact connected to universal human rights, such as the right to privacy, freedom from discrimination, and freedom of expression. We highlight the need to examine the broad social impacts of this technology, and to bring ethical and human rights considerations to every stage of the application life-cycle, from task formulation and dataset design, to model training and evaluation, to application deployment. Guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including ‘nudging’, ‘quarantining’, value sensitive design, counter-narratives, style transfer, and AI-driven public education applications.evaluation, to application deployment. Guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including 'nudging', 'quarantining', value sensitive design, counter-narratives, style transfer, and AI-driven public education applications.


Adversarial Attack for Uncertainty Estimation: Identifying Critical Regions in Neural Networks

arXiv.org Machine Learning

We propose a novel method to capture data points near decision boundary in neural network that are often referred to a specific type of uncertainty. In our approach, we sought to perform uncertainty estimation based on the idea of adversarial attack method. In this paper, uncertainty estimates are derived from the input perturbations, unlike previous studies that provide perturbations on the model's parameters as in Bayesian approach. We are able to produce uncertainty with couple of perturbations on the inputs. Interestingly, we apply the proposed method to datasets derived from blockchain. We compare the performance of model uncertainty with the most recent uncertainty methods. We show that the proposed method has revealed a significant outperformance over other methods and provided less risk to capture model uncertainty in machine learning.


A unified framework for bandit multiple testing

arXiv.org Machine Learning

In bandit multiple hypothesis testing, each arm corresponds to a different null hypothesis that we wish to test, and the goal is to design adaptive algorithms that correctly identify large set of interesting arms (true discoveries), while only mistakenly identifying a few uninteresting ones (false discoveries). One common metric in non-bandit multiple testing is the false discovery rate (FDR). We propose a unified, modular framework for bandit FDR control that emphasizes the decoupling of exploration and summarization of evidence. We utilize the powerful martingale-based concept of ``e-processes'' to ensure FDR control for arbitrary composite nulls, exploration rules and stopping times in generic problem settings. In particular, valid FDR control holds even if the reward distributions of the arms could be dependent, multiple arms may be queried simultaneously, and multiple (cooperating or competing) agents may be querying arms, covering combinatorial semi-bandit type settings as well. Prior work has considered in great detail the setting where each arm's reward distribution is independent and sub-Gaussian, and a single arm is queried at each step. Our framework recovers matching sample complexity guarantees in this special case, and performs comparably or better in practice. For other settings, sample complexities will depend on the finer details of the problem (composite nulls being tested, exploration algorithm, data dependence structure, stopping rule) and we do not explore these; our contribution is to show that the FDR guarantee is clean and entirely agnostic to these details.


Innovations Autoencoder and its Application in One-class Anomalous Sequence Detection

arXiv.org Machine Learning

An innovations sequence of a time series is a sequence of independent and identically distributed random variables with which the original time series has a causal representation. The innovation at a time is statistically independent of the history of the time series. As such, it represents the new information contained at present but not in the past. Because of its simple probability structure, an innovations sequence is the most efficient signature of the original. Unlike the principle or independent component analysis representations, an innovations sequence preserves not only the complete statistical properties but also the temporal order of the original time series. An long-standing open problem is to find a computationally tractable way to extract an innovations sequence of non-Gaussian processes. This paper presents a deep learning approach, referred to as Innovations Autoencoder (IAE), that extracts innovations sequences using a causal convolutional neural network. An application of IAE to the one-class anomalous sequence detection problem with unknown anomaly and anomaly-free models is also presented.