Goto

Collaborating Authors

 Regression


Reviews: Unbiased estimates for linear regression via volume sampling

Neural Information Processing Systems

I could go either way on this paper, though am slightly positive. The short summary is that the submission gives elegant expectation bounds with non-trivial arguments, but if one wants constant factor approximations (or 1 eps)-approximations), then existing algorithms are faster and read fewer labels. So it's unclear to me if there is a solid application of the results in the paper. In more detail: On the positive side it's great to see an unbiased estimator of the pseudoinverse by volume sampling, which by linearity gives an unbiased estimator to the least squares solution vector. I haven't seen such a statement before. It's also nice to see an unbiased estimator of the least squares loss function when exactly d samples are taken.


Reviews: Boosted Sparse and Low-Rank Tensor Regression

Neural Information Processing Systems

This paper examines the problem of tensor regression and proposes a boosted sparse low-rank model that produces interpretable results. In their low-rank tensor regression model, unit-rank tensors from the CP decomposition of the coefficient tensor is assumed to be sparse. This assumption allows for an interpretable model where the outcome is related to only a subset of features. For model estimation, the authors use a divide-and-conquer strategy to learn the sparse CP decomposition, based on an existing sequential extraction method, where sparse unit-rank problems are sequentially solved. Instead of using an alternating convex search (ACS) approach, the authors use a stage-wise unit-rank tensor factorization algorithm to learn the model.


Reviews: Sparse PCA from Sparse Linear Regression

Neural Information Processing Systems

The paper proposes an approach to reduce solving a special sparse PCA to a sparse linear regression (SLR) problem (treated as a black-box solution). It uses the spiked covariance model [17] and assumes that the number of nonzero components of the direction (u) is known, plus some technical conditions such as a restricted eigenvalue property. The authors propose algorithms for both hypothesis testing and support recovery, as well as provide theoretical performance guarantees for them. Finally, the paper argues that the approach is robust to rescaling and presents some numerical experiments comparing two variants of the method (based on SLR methods FoBa and LASSO) with two alternatives (diagonal thresholding and covariance thresholding). Strengths: - The addressed problem (sparse PCA) is interesting and important.


Reviews: Hunting for Discriminatory Proxies in Linear Regression Models

Neural Information Processing Systems

Summary This paper describes a framework for detecting proxy variables in a linear regression framework. It poses the problem as two optimization problems and presents (with proofs only in supplemental material) theorems that relate the solutions to the two optimization problems to cases of proxy existence in a problem. The paper also describes incorporation of an exempt variable, a proxy that is deemed acceptable for use for one reason or another. The paper leverages a prior work that defines a proxy in a classification framework as a variable that is associated with a sensitive attriute and causally infulential on the decision of the system. The paper describes how to reformulate this definition for the case of linear regression.


Reviews: On Coresets for Logistic Regression

Neural Information Processing Systems

The goal of this paper is to speed up logistic regression using a coreset based approach. The key idea is to "compress" the data set into a small fake set of points (called coreset) and to then train on that small set. The authors first show that, in general, no sublinear size coreset can exist. Then, they provide an algorithm that provides small summaries for certain data sets that satisfy a complexity assumption. Finally, they empirically compare that algorithm to two competing methods.


Reviews: Leveraged volume sampling for linear regression

Neural Information Processing Systems

This paper studies deficiencies of volume sampling, and proposes a modification based on leverage scores, or renormalizing the current ellipse before performing volumne rejection sampling. It improves the number of unbiased samples required to guarantee 1\pm\epsilon accuracy by a factor of \epsilon {-1}, and also demonstrates the good empirical performances of its routines on datasets from LibSVM (in Supplementary materials E). Both linear regression and volume sampling are well studied topics, and the observations made in this paper are quite surprising. The paper clearly outlines a class of matrices that are problematic for volume sampling, and then proves the properties of the revised methods. The proposed methods also exhibit significant empirical gains over other methods in the small sample size regime, which are arguable the more important cases. I believe these contributions are of significant interest to the study of both randomized sampling and randomized numerical linear algebra.


Reviews: Scalable Hyperparameter Transfer Learning

Neural Information Processing Systems

This paper proposes a novel Bayesian Optimization approach that is able to do transfer learning across tasks while remaining scalable. Originality: This is very original work. Bayesian Optimization can work with any probabilistic regression algorithm, so the use of Bayesian linear regression to make it more scalable is well-known, as are its limitations (e.g. it doesn't extrapolate well). The main novelty here lies in the extension to multi-task learning, which allows it to benefit from prior evaluations on previous tasks. When such evaluations are available, this can provide a significant advantage.


Reviews: Analytic solution and stationary phase approximation for the Bayesian lasso and elastic net

Neural Information Processing Systems

Summary An approximation to the posterior distribution from a Bayesian lasso or Bayesian elastic net prior is developed. The method uses a saddle-point approximation to the partition function. This is developed by writing the posterior distribution in terms of tau n / sigma 2 and uses an approximation for large tau. The results are illustrated on three data sets: diabetes (n 442, p 10), leukaemia (n 72, p 3571) and Cancer Cell Line Encyclopedia (n 474, p 1000). These demonstrate some of the performance characteristics of the approximation.


Tourism destination events classifier based on artificial intelligence techniques

arXiv.org Artificial Intelligence

Identifying client needs to provide optimal services is crucial in tourist destination management. The events held in tourist destinations may help to meet those needs and thus contribute to tourist satisfaction. As with product management, the creation of hierarchical catalogs to classify those events can aid event management. The events that can be found on the internet are listed in dispersed, heterogeneous sources, which makes direct classification a difficult, time-consuming task. The main aim of this work is to create a novel process for automatically classifying an eclectic variety of tourist events using a hierarchical taxonomy, which can be applied to support tourist destination management. Leveraging data science methods such as CRISP-DM, supervised machine learning, and natural language processing techniques, the automatic classification process proposed here allows the creation of a normalized catalog across very different geographical regions. Therefore, we can build catalogs with consistent filters, allowing users to find events regardless of the event categories assigned at source, if any. This is very valuable for companies that offer this kind of information across multiple regions, such as airlines, travel agencies or hotel chains. Ultimately, this tool has the potential to revolutionize the way companies and end users interact with tourist events information.


Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

arXiv.org Artificial Intelligence

As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34\% increase in accuracy and 33\% decrease in ECE.