Goto

Collaborating Authors

 Regression


dpart: Differentially Private Autoregressive Tabular, a General Framework for Synthetic Data Generation

arXiv.org Artificial Intelligence

We propose a general, flexible, and scalable framework dpart, an open source Python library for differentially private synthetic data generation. Central to the approach is autoregressive modelling -- breaking the joint data distribution to a sequence of lower-dimensional conditional distributions, captured by various methods such as machine learning models (logistic/linear regression, decision trees, etc.), simple histogram counts, or custom techniques. The library has been created with a view to serve as a quick and accessible baseline as well as to accommodate a wide audience of users, from those making their first steps in synthetic data generation, to more experienced ones with domain expertise who can configure different aspects of the modelling and contribute new methods/mechanisms. Specific instances of dpart include Independent, an optimized version of PrivBayes, and a newly proposed model, dp-synthpop. Code: https://github.com/hazy/dpart


(Nearly) Optimal Private Linear Regression via Adaptive Clipping

arXiv.org Artificial Intelligence

We study the problem of differentially private linear regression where each data point is sampled from a fixed sub-Gaussian style distribution. We propose and analyze a one-pass mini-batch stochastic gradient descent method (DP-AMBSSGD) where points in each iteration are sampled without replacement. Noise is added for DP but the noise standard deviation is estimated online. Compared to existing $(\epsilon, \delta)$-DP techniques which have sub-optimal error bounds, DP-AMBSSGD is able to provide nearly optimal error bounds in terms of key parameters like dimensionality $d$, number of points $N$, and the standard deviation $\sigma$ of the noise in observations. For example, when the $d$-dimensional covariates are sampled i.i.d. from the normal distribution, then the excess error of DP-AMBSSGD due to privacy is $\frac{\sigma^2 d}{N}(1+\frac{d}{\epsilon^2 N})$, i.e., the error is meaningful when number of samples $N= \Omega(d \log d)$ which is the standard operative regime for linear regression. In contrast, error bounds for existing efficient methods in this setting are: $\mathcal{O}\big(\frac{d^3}{\epsilon^2 N^2}\big)$, even for $\sigma=0$. That is, for constant $\epsilon$, the existing techniques require $N=\Omega(d\sqrt{d})$ to provide a non-trivial result.


Machine Learning Assisted Approach for Security-Constrained Unit Commitment

arXiv.org Artificial Intelligence

Security-constrained unit commitment (SCUC) is solved for power system day-ahead generation scheduling, which is a large-scale mixed-integer linear programming problem and is very computationally intensive. Model reduction of SCUC may bring significant time savings. In this work, a novel approach is proposed to effectively utilize machine learning (ML) to reduce the problem size of SCUC. An ML model using logistic regression (LR) algorithm is proposed and trained with historical nodal demand profiles and the respective commitment schedules. The ML outputs are processed and analyzed to reduce variables and constraints in SCUC. The proposed approach is validated on several standard test systems including IEEE 24-bus system, IEEE 73-bus system, IEEE 118-bus system, synthetic South Carolina 500-bus system and Polish 2383-bus system. Simulation results demonstrate that the use of the prediction from the proposed LR model in SCUC model reduction can substantially reduce the computing time while maintaining solution quality.


NumS: Scalable Array Programming for the Cloud

arXiv.org Artificial Intelligence

Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.


Most Famous Supervised Learning Algorithms

#artificialintelligence

Supervised learning is the type of machine learning in which machines are trained using well "labeled" training data, and on basis of that data, machines predict the output. The labeled data means some input data is already tagged with the correct output. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y). Supervised learning can be used for Risk Assessment, Image classification, Fraud Detection, Spam Filtering, etc. Regression algorithms are used if there is a relationship between the input variable and the output variable. Classification algorithms are used when the output variable is categorical, which means there are two classes such as Yes-No, Male-Female, True-false, etc. Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable.


Detecting People Interested in Non-Suicidal Self-Injury on Social Media

arXiv.org Artificial Intelligence

Non-Suicidal Self-Injury (NSSI) is the intentional destruction of body tissue without the intent to commit suicide [1]. It is particularly prevalent among adolescents and young adults as a means of emotional control and release. Typical NSSI activities include skin cutting, banging or hitting oneself, and burns. Recent prevalence estimates suggest that 14%-21% of adolescents and 17%-25% of young adults have engaged in NSSI at some point in their lives. NSSI is repeatedly found to be associated with significant emotional and behavioral dysfunction (such as eating disorders and suicide).


A novel evaluation methodology for supervised Feature Ranking algorithms

arXiv.org Artificial Intelligence

Both in the domains of Feature Selection and Interpretable AI, there exists a desire to `rank' features based on their importance. Such feature importance rankings can then be used to either: (1) reduce the dataset size or (2) interpret the Machine Learning model. In the literature, however, such Feature Rankers are not evaluated in a systematic, consistent way. Many papers have a different way of arguing which feature importance ranker works best. This paper fills this gap, by proposing a new evaluation methodology. By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation. To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval. The framework allows running experiments in parallel and distributed over machines on HPC systems. By integrating with an online platform called Weights and Biases, charts can be interactively explored on a live dashboard. The software was released as open-source software, and is published as a package on the PyPi platform. The research concludes by exploring one such large-scale experiment, to find the strengths and weaknesses of the participating algorithms, on many fronts.


CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Journal of Artificial Intelligence Research

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model's performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.


Discussing a multiple regression model

#artificialintelligence

On this part we shall consider a curious example, I must say. As said our reference for this case study [7, p. 74]: "If we were the only ones in the world with access to this info, we could be the best Boston real-estate investors in 1978! Unless, somehow, someone were able to build an even more accurate estimate . . This is the Boston House problem. Essentially, the problem is used as benchmark for machine learning, generally, on competitions. "to estimate the median value of the house prices in a neighborhood (MEDV) given all the input features from the neighborhood." This problem is different from the previous one only because we have several inputs instead of just one. This problem is closer from reality since most problem, at least the one that can be useful, will have to do more than humans can do either with simple models or by head; and machine learning is good at it! As long as you have the computer power, and time to wait, they solve it with their feet on their backs, if they have any! One interesting reflection we shall do is regarding interpreting their inner workings, beyond just prediction. Prediction is the process by which we want to know what is next in time, on a system (e.g., stock market or demands on a company). "Is there any way to peek inside the model to see how it understands the data?….


Linear Regression vs. Logistic Regression

#artificialintelligence

I am writing this article to make a deep understanding of the similarity and differences between Linear and Logistic regression algorithm and their working with help of their code. As we know that Linear Regression is a supervised Machine Learning algorithm, is a statistical method which is used to study of relationships between two continuous variables i.e. dependent and independent variable. It also predicts continuous values and finds the best fitting line that describes variables. Logistic Regression is used to predict categorical data. It is an another supervised machine learning algorithm used statistically analyzing a dataset in which there are one or more independent variables that determine an outcome.