Regression
The Challenge of Differentially Private Screening Rules
Khanna, Amol, Lu, Fred, Raff, Edward
Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models in information retrieval, to the best of our knoweledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.
An End to End Machine Learning Model Development Guide Using BigQuery ML
Now, depending on your use case, You can identify the type of machine learning model you need (We will not be talking about the core ML concepts and how to choose the ML model for the use case in this article.). Once we have the type of ML model that we need, we can use the GoogleSQL queries to create the model. The syntax for creating the model can be found here. For example, to create a logistic regression model using the Google Analytics sample dataset for BigQuery. The following GoogleSQL query is used to create the model you use to predict whether a website visitor will make a transaction.
Trustera: A Live Conversation Redaction System
Gouvêa, Evandro, Dadgar, Ali, Jalalvand, Shahab, Chengalvarayan, Rathi, Jayakumar, Badrinath, Price, Ryan, Ruiz, Nicholas, McGovern, Jennifer, Bangalore, Srinivas, Stern, Ben
Trustera, the first functional system that redacts personally identifiable information (PII) in real-time spoken conversations to remove agents' need to hear sensitive information while preserving the naturalness of live customer-agent conversations. As opposed to post-call redaction, audio masking starts as soon as the customer begins speaking to a PII entity. This significantly reduces the risk of PII being intercepted or stored in insecure data storage. Trustera's architecture consists of a pipeline of automatic speech recognition, natural language understanding, and a live audio redactor module. The system's goal is three-fold: redact entities that are PII, mask the audio that goes to the agent, and at the same time capture the entity, so that the captured PII can be used for a payment transaction or caller identification. Trustera is currently being used by thousands of agents to secure customers' sensitive information.
Controlled Descent Training
Andersson, Viktor, Varga, Balázs, Szolnoky, Vincent, Syrén, Andreas, Jörnsten, Rebecka, Kulcsár, Balázs
In this work, a novel and model-based artificial neural network (ANN) training method is developed supported by optimal control theory. The method augments training labels in order to robustly guarantee training loss convergence and improve training convergence rate. Dynamic label augmentation is proposed within the framework of gradient descent training where the convergence of training loss is controlled. First, we capture the training behavior with the help of empirical Neural Tangent Kernels (NTK) and borrow tools from systems and control theory to analyze both the local and global training dynamics (e.g. stability, reachability). Second, we propose to dynamically alter the gradient descent training mechanism via fictitious labels as control inputs and an optimal state feedback policy. In this way, we enforce locally $\mathcal{H}_2$ optimal and convergent training behavior. The novel algorithm, \textit{Controlled Descent Training} (CDT), guarantees local convergence. CDT unleashes new potentials in the analysis, interpretation, and design of ANN architectures. The applicability of the method is demonstrated on standard regression and classification problems.
Easy Differentially Private Linear Regression
Amin, Kareem, Joseph, Matthew, Ribero, Mónica, Vassilvitskii, Sergei
Linear regression is a fundamental tool for statistical analysis. This has motivated the development of linear regression methods that also satisfy differential privacy and thus guarantee that the learned model reveals little about any one data point used to construct it. However, existing differentially private solutions assume that the end user can easily specify good data bounds and hyperparameters. Both present significant practical obstacles. In this paper, we study an algorithm which uses the exponential mechanism to select a model with high Tukey depth from a collection of non-private regression models. Given $n$ samples of $d$-dimensional data used to train $m$ models, we construct an efficient analogue using an approximate Tukey depth that runs in time $O(d^2n + dm\log(m))$. We find that this algorithm obtains strong empirical performance in the data-rich setting with no data bounds or hyperparameter selection required.
Machine learning based biomedical image processing for echocardiographic images
Heena, Ayesha, Biradar, Nagashettappa, Maroof, Najmuddin M., Bhatia, Surbhi, Agarwal, Rashmi, Prasad, Kanta
The popularity of Artificial intelligence and machine learning have prompted researchers to use it in the recent researches. The proposed method uses K-Nearest Neighbor (KNN) algorithm for segmentation of medical images, extracting of image features for analysis by classifying the data based on the neural networks. Classification of the images in medical imaging is very important, KNN is one suitable algorithm which is simple, conceptual and computational, which provides very good accuracy in results. KNN algorithm is a unique user-friendly approach with wide range of applications in machine learning algorithms which are majorly used for the various image processing applications including classification, segmentation and regression issues of the image processing. The proposed system uses gray level co-occurrence matrix features. The trained neural network has been tested successfully on a group of echocardiographic images, errors were compared using regression plot. The results of the algorithm are tested using various quantitative as well as qualitative metrics and proven to exhibit better performance in terms of both quantitative and qualitative metrics in terms of current state -of-the-art methods in the related area. To compare the performance of trained neural network the regression analysis performed showed a good correlation.
Human-AI Collaboration: The Effect of AI Delegation on Human Task Performance and Task Satisfaction
Hemmer, Patrick, Westphal, Monika, Schemmer, Max, Vetter, Sebastian, Vössing, Michael, Satzger, Gerhard
Recent work has proposed artificial intelligence (AI) models that can learn to decide whether to make a prediction for an instance of a task or to delegate it to a human by considering both parties' capabilities. In simulations with synthetically generated or context-independent human predictions, delegation can help improve the performance of human-AI teams -- compared to humans or the AI model completing the task alone. However, so far, it remains unclear how humans perform and how they perceive the task when they are aware that an AI model delegated task instances to them. In an experimental study with 196 participants, we show that task performance and task satisfaction improve through AI delegation, regardless of whether humans are aware of the delegation. Additionally, we identify humans' increased levels of self-efficacy as the underlying mechanism for these improvements in performance and satisfaction. Our findings provide initial evidence that allowing AI models to take over more management responsibilities can be an effective form of human-AI collaboration in workplaces.
Inadmissibility of the corrected Akaike information criterion
For the multivariate linear regression model with unknown covariance, the corrected Akaike information criterion is the minimum variance unbiased estimator of the expected Kullback--Leibler discrepancy. In this study, based on the loss estimation framework, we show its inadmissibility as an estimator of the Kullback--Leibler discrepancy itself, instead of the expected Kullback--Leibler discrepancy. We provide improved estimators of the Kullback--Leibler discrepancy that work well in reduced-rank situations and examine their performance numerically.
Forecasting Particle Accelerator Interruptions Using Logistic LASSO Regression
Li, Sichen, Snuverink, Jochem, Perez-Cruz, Fernando, Adelmann, Andreas
Unforeseen particle accelerator interruptions, also known as interlocks, lead to abrupt operational changes despite being necessary safety measures. These may result in substantial loss of beam time and perhaps even equipment damage. We propose a simple yet powerful binary classification model aiming to forecast such interruptions, in the case of the High Intensity Proton Accelerator complex at the Paul Scherrer Institut. The model is formulated as logistic regression penalized by least absolute shrinkage and selection operator, based on a statistical two sample test to distinguish between unstable and stable states of the accelerator. The primary objective for receiving alarms prior to interlocks is to allow for countermeasures and reduce beam time loss. Hence, a continuous evaluation metric is developed to measure the saved beam time in any period, given the assumption that interlocks could be circumvented by reducing the beam current. The best-performing interlock-to-stable classifier can potentially increase the beam time by around 5 min in a day. Possible instrumentation for fast adjustment of the beam current is also listed and discussed.
Optimal Sampling Designs for Multi-dimensional Streaming Time Series with Application to Power Grid Sensor Data
Xie, Rui, Bai, Shuyang, Ma, Ping
The Internet of Things (IoT) system generates massive high-speed temporally correlated streaming data and is often connected with online inference tasks under computational or energy constraints. Online analysis of these streaming time series data often faces a trade-off between statistical efficiency and computational cost. One important approach to balance this trade-off is sampling, where only a small portion of the sample is selected for the model fitting and update. Motivated by the demands of dynamic relationship analysis of IoT system, we study the data-dependent sample selection and online inference problem for a multi-dimensional streaming time series, aiming to provide low-cost real-time analysis of high-speed power grid electricity consumption data. Inspired by D-optimality criterion in design of experiments, we propose a class of online data reduction methods that achieve an optimal sampling criterion and improve the computational efficiency of the online analysis. We show that the optimal solution amounts to a strategy that is a mixture of Bernoulli sampling and leverage score sampling. The leverage score sampling involves auxiliary estimations that have a computational advantage over recursive least squares updates. Theoretical properties of the auxiliary estimations involved are also discussed. When applied to European power grid consumption data, the proposed leverage score based sampling methods outperform the benchmark sampling method in online estimation and prediction. The general applicability of the sampling-assisted online estimation method is assessed via simulation studies.