Regression
Differentially Private Estimation via Statistical Depth
Constructing a differentially private (DP) estimator requires deriving the maximum influence of an observation, which can be difficult in the absence of exogenous bounds on the input data or the estimator, especially in high dimensional settings. This paper shows that standard notions of statistical depth, i.e., halfspace depth and regression depth, are particularly advantageous in this regard, both in the sense that the maximum influence of a single observation is easy to analyze and that this value is typically low. This is used to motivate new approximate DP location and regression estimators using the maximizers of these two notions of statistical depth. A more computationally efficient variant of the approximate DP regression estimator is also provided. Also, to avoid requiring that users specify a priori bounds on the estimates and/or the observations, variants of these DP mechanisms are described that satisfy random differential privacy (RDP), which is a relaxation of differential privacy provided by Hall, Wasserman, and Rinaldo (2013). We also provide simulations of the two DP regression methods proposed here. The proposed estimators appear to perform favorably relative to the existing DP regression methods we consider in these simulations when either the sample size is at least 100-200 or the privacy-loss budget is sufficiently high.
UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu
Amjad, Maaz, Sidorov, Grigori, Zhila, Alisa, Gelbukh, Alexander, Rosso, Paolo
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language. This is a binary classification task in which the goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing. The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business. 42 teams from 6 different countries (India, China, Egypt, Germany, Pakistan, and the UK) registered for the task. 9 teams submitted their experimental results. The participants used various machine learning methods ranging from feature-based traditional machine learning to neural network techniques. The best performing system achieved an F-score value of 0.90, showing that the BERT-based approach outperforms other machine learning classifiers.
YouTube Spam Comment Prediction - Projects Based Learning
Process Comma-separated values file (ie file with .csv Convert String data to Numeric format so we can process the data in Apache Spark ML Library. Welcome to this project on creating prediction model to Identify spam comment in Apache Spark Machine Learning using Databricks platform community edition server which allows you to execute your spark code, free of cost on their server just by registering through email id. In this project we explore Apache Spark and Machine Learning on the Databricks platform. I am a firm believer that the best way to learn is by doing.
Testing the Robustness of Learned Index Structures
Bachfischer, Matthias, Borovica-Gajic, Renata, Rubinstein, Benjamin I. P.
While early empirical evidence has supported the case for learned index structures as having favourable average-case performance, little is known about their worst-case performance. By contrast, classical structures are known to achieve optimal worst-case behaviour. This work evaluates the robustness of learned index structures in the presence of adversarial workloads. To simulate adversarial workloads, we carry out a data poisoning attack on linear regression models that manipulates the cumulative distribution function (CDF) on which the learned index model is trained. The attack deteriorates the fit of the underlying ML model by injecting a set of poisoning keys into the training dataset, which leads to an increase in the prediction error of the model and thus deteriorates the overall performance of the learned index structure. We assess the performance of various regression methods and the learned index implementations ALEX and PGM-Index. We show that learned index structures can suffer from a significant performance deterioration of up to 20% when evaluated on poisoned vs. non-poisoned datasets.
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU.
5 Minutes Cheat Sheet Explaining all Machine Learning Models
Many times, it happens that you have an interview in a few days, and your schedule is jam-packed to prepare for it. Or maybe you are in revision mode and want to look at all the basic popular machine learning models. If that is the case, you have come to the right place. In this blog, I will briefly explain some of the most commonly asked machine learning models in interviews. I will also list important parameters related to each model and a source to find a detailed explanation of the same topic, so you can dig deeper if and when required.
Correntropy-Based Logistic Regression with Automatic Relevance Determination for Robust Sparse Brain Activity Decoding
Li, Yuanhao, Chen, Badong, Shi, Yuxi, Yoshimura, Natsue, Koike, Yasuharu
Recent studies have utilized sparse classifications to predict categorical variables from high-dimensional brain activity signals to expose human's intentions and mental states, selecting the relevant features automatically in the model training process. However, existing sparse classification models will likely be prone to the performance degradation which is caused by noise inherent in the brain recordings. To address this issue, we aim to propose a new robust and sparse classification algorithm in this study. To this end, we introduce the correntropy learning framework into the automatic relevance determination based sparse classification model, proposing a new correntropy-based robust sparse logistic regression algorithm. To demonstrate the superior brain activity decoding performance of the proposed algorithm, we evaluate it on a synthetic dataset, an electroencephalogram (EEG) dataset, and a functional magnetic resonance imaging (fMRI) dataset. The extensive experimental results confirm that not only the proposed method can achieve higher classification accuracy in a noisy and high-dimensional classification task, but also it would select those more informative features for the decoding scenarios. Integrating the correntropy learning approach with the automatic relevance determination technique will significantly improve the robustness with respect to the noise, leading to more adequate robust sparse brain decoding algorithm. It provides a more powerful approach in the real-world brain activity decoding and the brain-computer interfaces.
Provably tuning the ElasticNet across instances
Balcan, Maria-Florina, Khodak, Mikhail, Sharma, Dravyansh, Talwalkar, Ameet
An important unresolved challenge in the theory of regularization is to set the regularization coefficients of popular techniques like the ElasticNet with general provable guarantees. We consider the problem of tuning the regularization parameters of Ridge regression, LASSO, and the ElasticNet across multiple problem instances, a setting that encompasses both cross-validation and multi-task hyperparameter optimization. We obtain a novel structural result for the ElasticNet which characterizes the loss as a function of the tuning parameters as a piecewise-rational function with algebraic boundaries. We use this to bound the structural complexity of the regularized loss functions and show generalization guarantees for tuning the ElasticNet regression coefficients in the statistical setting. We also consider the more challenging online learning setting, where we show vanishing average expected regret relative to the optimal parameter pair. We further extend our results to tuning classification algorithms obtained by thresholding regression fits regularized by Ridge, LASSO, or ElasticNet. Our results are the first general learning-theoretic guarantees for this important class of problems that avoid strong assumptions on the data distribution. Furthermore, our guarantees hold for both validation and popular information criterion objectives.
Data-Centric Epidemic Forecasting: A Survey
Rodrรญguez, Alexander, Kamarthi, Harshavardhan, Agarwal, Pulak, Ho, Javen, Patel, Mira, Sapre, Suchet, Prakash, B. Aditya
The COVID-19 pandemic has brought forth the importance of epidemic forecasting for decision makers in multiple domains, ranging from public health to the economy as a whole. While forecasting epidemic progression is frequently conceptualized as being analogous to weather forecasting, however it has some key differences and remains a non-trivial task. The spread of diseases is subject to multiple confounding factors spanning human behavior, pathogen dynamics, weather and environmental conditions. Research interest has been fueled by the increased availability of rich data sources capturing previously unobservable facets and also due to initiatives from government public health and funding agencies. This has resulted, in particular, in a spate of work on 'data-centered' solutions which have shown potential in enhancing our forecasting capabilities by leveraging non-traditional data sources as well as recent innovations in AI and machine learning. This survey delves into various data-driven methodological and practical advancements and introduces a conceptual framework to navigate through them. First, we enumerate the large number of epidemiological datasets and novel data streams that are relevant to epidemic forecasting, capturing various factors like symptomatic online surveys, retail and commerce, mobility, genomics data and more. Next, we discuss methods and modeling paradigms focusing on the recent data-driven statistical and deep-learning based methods as well as on the novel class of hybrid models that combine domain knowledge of mechanistic models with the effectiveness and flexibility of statistical approaches. We also discuss experiences and challenges that arise in real-world deployment of these forecasting systems including decision-making informed by forecasts. Finally, we highlight some challenges and open problems found across the forecasting pipeline.
Quantum Algorithms and Lower Bounds for Linear Regression with Norm Constraints
Lasso and Ridge are important minimization problems in machine learning and statistics. They are versions of linear regression with squared loss where the vector $\theta\in\mathbb{R}^d$ of coefficients is constrained in either $\ell_1$-norm (for Lasso) or in $\ell_2$-norm (for Ridge). We study the complexity of quantum algorithms for finding $\varepsilon$-minimizers for these minimization problems. We show that for Lasso we can get a quadratic quantum speedup in terms of $d$ by speeding up the cost-per-iteration of the Frank-Wolfe algorithm, while for Ridge the best quantum algorithms are linear in $d$, as are the best classical algorithms. As a byproduct of our quantum lower bound for Lasso, we also prove the first classical lower bound for Lasso that is tight up to polylog-factors.