Goto

Collaborating Authors

 Regression


Improved identification accuracy in equation learning via comprehensive $\boldsymbol{R^2}$-elimination and Bayesian model selection

arXiv.org Machine Learning

In the field of equation learning, exhaustively considering all possible equations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of multicollinearity poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude terms of the true equation, leading to reduced identification accuracy. In this article, we present an approach that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(\boldsymbol y|\mathcal M)$, in a novel way. Our procedure is characterized by a comprehensive search with just a minor reduction of the model space at each iteration step. With two flavors of our approach and the adoption of $p(\boldsymbol y|\mathcal M)$ for bi-directional stepwise regression, we present a total of three new avenues for equation learning. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our approach against four state-of-the-art methods and two standard approaches. The results demonstrate that our comprehensive search approach surpasses all other methods in terms of identification accuracy. In particular, the second flavor of our approach establishes an efficient overfitting penalty solely based on $R^2$, which achieves highest rates of exact equation recovery.


Anchor Data Augmentation

arXiv.org Machine Learning

We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regression (AR) method for data augmentation, which is in contrast to the current state-of-the-art domain-agnostic solutions that rely on the Mixup literature. Our Anchor Data Augmentation (ADA) uses several replicas of the modified samples in AR to provide more training examples, leading to more robust regression predictions. We apply ADA to linear and nonlinear regression problems using neural networks. ADA is competitive with state-of-the-art C-Mixup solutions.


Constrained Optimization of Rank-One Functions with Indicator Variables

arXiv.org Machine Learning

Optimization problems involving minimization of a rank-one convex function over constraints modeling restrictions on the support of the decision variables emerge in various machine learning applications. These problems are often modeled with indicator variables for identifying the support of the continuous variables. In this paper we investigate compact extended formulations for such problems through perspective reformulation techniques. In contrast to the majority of previous work that relies on support function arguments and disjunctive programming techniques to provide convex hull results, we propose a constructive approach that exploits a hidden conic structure induced by perspective functions. To this end, we first establish a convex hull result for a general conic mixed-binary set in which each conic constraint involves a linear function of independent continuous variables and a set of binary variables. We then demonstrate that extended representations of sets associated with epigraphs of rank-one convex functions over constraints modeling indicator relations naturally admit such a conic representation. This enables us to systematically give perspective formulations for the convex hull descriptions of these sets with nonlinear separable or non-separable objective functions, sign constraints on continuous variables, and combinatorial constraints on indicator variables. We illustrate the efficacy of our results on sparse nonnegative logistic regression problems.


Dimensionality Reduction and Wasserstein Stability for Kernel Regression

arXiv.org Machine Learning

In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable with kernel regression. In order to analyze the resulting regression errors, a novel stability result for kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit the regression function. We apply the general stability result to principal component analysis (PCA). Exploiting known estimates from the literature on both principal component analysis and kernel regression, we deduce convergence rates for the two-step procedure. The latter turns out to be particularly useful in a semi-supervised setting.


Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

arXiv.org Artificial Intelligence

There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations, our algorithm can find an $\epsilon$-approximate minimizer of the training loss with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C) + d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and $\omega < 2.374$ is the matrix multiplication exponent.


AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

arXiv.org Artificial Intelligence

Predicting opinion trends on a range of social issues, from climate change to gay marriage, is crucial for making informed decisions, tracking social changes, and understanding the dynamics of opinion formation (Brooks and Manza, 2006; Burstein, 2003). Recently, numerous breakthroughs have been made to infer and predict people's opinions and preferences from their written records, such as books in the past (e.g., Google Ngram), internet search patterns (e.g., Google Trend), and public sentiments in social media (e.g., Twitter, Facebook, YouTube) (Beauchamp, 2017; Grimmer et al., 2022; Moore et al., 2019; O'Connor et al., 2010; Stephens-Davidowitz, 2017). However, using digital trace data for predicting public opinion presents a substantial challenge, as these "proxy" measures cannot be deemed reliable without validating them against other "ground truth" benchmarks, like surveys (Beauchamp, 2017; Ferraro and Farmer, 1999). Even if digital trace data can closely track public opinion trends, its unobtrusive and anonymous nature prompts questions about its ability to truly represent the diverse voices of the population, particularly considering the skewed representation of demographic groups in digital traces (Cesare et al., 2018). The reliance on digital trace data, despite covering a broad spectrum of opinions, makes it hard to evenly represent the real voice of the entire population.


Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions

arXiv.org Artificial Intelligence

We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions, as well as approximations with tree tensor networks and more general tensor networks. For tensor train decomposition, we give a bicriteria $(1 + \eps)$-approximation algorithm with a small bicriteria rank and $O(q \cdot \nnz(A))$ running time, up to lower order terms, which improves over the additive error algorithm of \cite{huber2017randomized}. We also show how to convert the algorithm of \cite{huber2017randomized} into a relative error algorithm, but their algorithm necessarily has a running time of $O(qr^2 \cdot \nnz(A)) + n \cdot \poly(qk/\eps)$ when converted to a $(1 + \eps)$-approximation algorithm with bicriteria rank $r$. To the best of our knowledge, our work is the first to achieve polynomial time relative error approximation for tensor train decomposition. Our key technique is a method for obtaining subspace embeddings with a number of rows polynomial in $q$ for a matrix which is the flattening of a tensor train of $q$ tensors. We extend our algorithm to tree tensor networks. In addition, we extend our algorithm to tensor networks with arbitrary graphs (which we refer to as general tensor networks), by using a result of \cite{ms08_simulating_quantum_tensor_contraction} and showing that a general tensor network of rank $k$ can be contracted to a binary tree network of rank $k^{O(\deg(G)\tw(G))}$, allowing us to reduce to the case of tree tensor networks. Finally, we give new fixed-parameter tractable algorithms for the tensor train, Tucker, and CP decompositions, which are simpler than those of \cite{swz19_tensor_low_rank} since they do not make use of polynomial system solvers. Our technique of Gaussian subspace embeddings with exactly $k$ rows (and thus exponentially small success probability) may be of independent interest.


Modelling wildland fire burn severity in California using a spatial Super Learner approach

arXiv.org Artificial Intelligence

Given the increasing prevalence of wildland fires in the Western US, there is a critical need to develop tools to understand and accurately predict burn severity. We develop a machine learning model to predict post-fire burn severity using pre-fire remotely sensed data. Hydrological, ecological, and topographical variables collected from four regions of California - the sites of the Kincade fire (2019), the CZU Lightning Complex fire (2020), the Windy fire (2021), and the KNP Fire (2021) - are used as predictors of the difference normalized burn ratio. We hypothesize that a Super Learner (SL) algorithm that accounts for spatial autocorrelation using Vecchia's Gaussian approximation will accurately model burn severity. In all combinations of test and training sets explored, the results of our model showed the SL algorithm outperformed standard Linear Regression methods. After fitting and verifying the performance of the SL model, we use interpretable machine learning tools to determine the main drivers of severe burn damage, including greenness, elevation and fire weather variables. These findings provide actionable insights that enable communities to strategize interventions, such as early fire detection systems, pre-fire season vegetation clearing activities, and resource allocation during emergency responses. When implemented, this model has the potential to minimize the loss of human life, property, resources, and ecosystems in California.


Environment Invariant Linear Least Squares

arXiv.org Machine Learning

This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.


Disruption Prediction in Fusion Devices through Feature Extraction and Logistic Regression

arXiv.org Artificial Intelligence

This document describes an approach used in the Multi-Machine Disruption Prediction Challenge for Fusion Energy by ITU, a data science competition which ran from September to November 2023, on the online platform Zindi. The competition involved data from three fusion devices - C-Mod, HL-2A, and J-TEXT - with most of the training data coming from the last two, and the test data coming from the first one. Each device has multiple diagnostics and signals, and it turns out that a critical issue in this competition was to identify which signals, and especially which features from those signals, were most relevant to achieve accurate predictions. The approach described here is based on extracting features from signals, and then applying logistic regression on top of those features. Each signal is treated as a separate predictor and, in the end, a combination of such predictors achieved the first place on the leaderboard.