Regression
Efficient adjustment for complex covariates: Gaining efficiency with DOPE
Christgau, Alexander Mangulad, Hansen, Niels Richard
Covariate adjustment is a ubiquitous method used to estimate the average treatment effect (ATE) from observational data. Assuming a known graphical structure of the data generating model, recent results give graphical criteria for optimal adjustment, which enables efficient estimation of the ATE. However, graphical approaches are challenging for high-dimensional and complex data, and it is not straightforward to specify a meaningful graphical model of non-Euclidean data such as texts. We propose an general framework that accommodates adjustment for any subset of information expressed by the covariates. We generalize prior works and leverage these results to identify the optimal covariate information for efficient adjustment. This information is minimally sufficient for prediction of the outcome conditionally on treatment. Based on our theoretical results, we propose the Debiased Outcome-adapted Propensity Estimator (DOPE) for efficient estimation of the ATE, and we provide asymptotic results for the DOPE under general conditions. Compared to the augmented inverse propensity weighted (AIPW) estimator, the DOPE can retain its efficiency even when the covariates are highly predictive of treatment. We illustrate this with a single-index model, and with an implementation of the DOPE based on neural networks, we demonstrate its performance on simulated and real data. Our results show that the DOPE provides an efficient and robust methodology for ATE estimation in various observational settings.
Vehicle-group-based Crash Risk Formation and Propagation Analysis for Expressways
Zhu, Tianheng, Wang, Ling, Feng, Yiheng, Ma, Wanjing, Abdel-Aty, Mohamed
Previous studies in predicting crash risk primarily associated the number or likelihood of crashes on a road segment with traffic parameters or geometric characteristics of the segment, usually neglecting the impact of vehicles' continuous movement and interactions with nearby vehicles. Advancements in communication technologies have empowered driving information collected from surrounding vehicles, enabling the study of group-based crash risks. Based on high-resolution vehicle trajectory data, this research focused on vehicle groups as the subject of analysis and explored risk formation and propagation mechanisms considering features of vehicle groups and road segments. Several key factors contributing to crash risks were identified, including past high-risk vehicle-group states, complex vehicle behaviors, high percentage of large vehicles, frequent lane changes within a vehicle group, and specific road geometries. A multinomial logistic regression model was developed to analyze the spatial risk propagation patterns, which were classified based on the trend of high-risk occurrences within vehicle groups. The results indicated that extended periods of high-risk states, increase in vehicle-group size, and frequent lane changes are associated with adverse risk propagation patterns. Conversely, smoother traffic flow and high initial crash risk values are linked to risk dissipation. Furthermore, the study conducted sensitivity analysis on different types of classifiers, prediction time intervalsss and adaptive TTC thresholds. The highest AUC value for vehicle-group risk prediction surpassed 0.93. The findings provide valuable insights to researchers and practitioners in understanding and prediction of vehicle-group safety, ultimately improving active traffic safety management and operations of Connected and Autonomous Vehicles.
Coresets for Relational Data and The Applications Jiaxiang Chen 1 Qingyuan Yang Ruomin Huang
A coreset is a small set that can approximately preserve the structure of the original input data set. Therefore we can run our algorithm on a coreset so as to reduce the total computational complexity. Conventional coreset techniques assume that the input data set is available to process explicitly. However, this assumption may not hold in real-world scenarios. In this paper, we consider the problem of coresets construction over relational data. Namely, the data is decoupled into several relational tables, and it could be very expensive to directly materialize the data matrix by joining the tables. We propose a novel approach called "aggregation tree with pseudo-cube" that can build a coreset from bottom to up. Moreover, our approach can neatly circumvent several troublesome issues of relational learning problems [Khamis et al., PODS 2019]. Under some mild assumptions, we show that our coreset approach can be applied for the machine learning tasks, such as clustering, logistic regression and SVM.
Adaptive Split Balancing for Optimal Random Forest
Zhang, Yuqian, Ji, Weijie, Bradic, Jelena
While random forests are commonly used for regression problems, existing methods often lack adaptability in complex situations or lose optimality under simple, smooth scenarios. In this study, we introduce the adaptive split balancing forest (ASBF), capable of learning tree representations from data while simultaneously achieving minimax optimality under the Lipschitz class. To exploit higher-order smoothness levels, we further propose a localized version that attains the minimax rate under the H\"older class $\mathcal{H}^{q,\beta}$ for any $q\in\mathbb{N}$ and $\beta\in(0,1]$. Rather than relying on the widely-used random feature selection, we consider a balanced modification to existing approaches. Our results indicate that an over-reliance on auxiliary randomness may compromise the approximation power of tree models, leading to suboptimal results. Conversely, a less random, more balanced approach demonstrates optimality. Additionally, we establish uniform upper bounds and explore the application of random forests in average treatment effect estimation problems. Through simulation studies and real-data applications, we demonstrate the superior empirical performance of the proposed methods over existing random forests.
Model Complexity, Goodness of Fit and Diminishing Returns
We consider modeling the data set D using models indexed by a complexity index k, 1:::; k:::; kmax • For example, the models could be finite mixture probability density functions (PDFs) for vector Xi'S where model complexity is indexed by the number of components k in the mixture. Alternatively, the modeling task could be to fit a conditional regression model y g(Zk) e, where now y is one of the variables in the vector X and Z is some subset of size k of the remaining components in the X vector. Such learning tasks can typically be characterized by the existence of a model and a loss function. A fitted model of complexity k is a function of the data points D and depends on a specific set of fitted parameters B. The loss function (goodness(cid:173) of-fit) is a functional of the model and maps each specific model to a scalar used to evaluate the model, e.g., likelihood for density estimation or sum-of-squares for regression. Figure 1 illustrates a typical empirical curve for loss function versus complexity, for mixtures of Markov models fitted to a large data set of 900,000 sequences. The complexity k is the number of Markov models being used in the mixture (see Cadez et al. (2000) for further details on the model and the data set). The empirical curve has a distinctly concave appearance, with large relative gains in fit for low complexity models and much more modest relative gains for high complexity models.
High-Dimensional Graphical Model Selection Using \ell_1 -Regularized Logistic Regression
We focus on the problem of estimating the graph structure associated with a discrete Markov random field. We describe a method based on 1- regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an1-constraint. Our framework applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Our main result is to estab- lish sufficient conditions on the triple (n, p, d) for the method to succeed in consistently estimating the neighborhood of every node in the graph simul- taneously. Under certain mutual incoherence conditions analogous to those imposed in previous work on linear regression, we prove that consistent neighborhood selection can be obtained as long as the number of observa- tions n grows more quickly than 6d6 log d 2d5 log p, thereby establishing that logarithmic growth in the number of samples n relative to graph size p is sufficient to achieve neighborhood consistency.
Phase transitions for high-dimensional joint support recovery
We consider the following instance of transfer learning: given a pair of regression problems, suppose that the regression coefficients share a partially common support, parameterized by the overlap fraction \overlap between the two supports. This set-up suggests the use of 1, \infty -regularized linear regression for recovering the support sets of both regression vectors. Our main contribution is to provide a sharp characterization of the sample complexity of this 1,\infty relaxation, exactly pinning down the minimal sample size n required for joint support recovery as a function of the model dimension \pdim, support size \spindex and overlap \overlap \in [0,1] . For measurement matrices drawn from standard Gaussian ensembles, we prove that the joint 1,\infty -regularized method undergoes a phase transition characterized by order parameter \orpar( umobs, \pdim, \spindex, \overlap) umobs{(4 - 3 \overlap) s \log(p-(2-\overlap)s)} . More precisely, the probability of successfully recovering both supports converges to 1 for scalings such that \orpar 1, and converges to 0 to scalings for which \orpar 1 .
Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
This paper uses information-theoretic techniques to determine minimax rates for estimating nonparametric sparse additive regression models under high-dimensional scaling. The first term reflects the difficulty of performing \emph{subset selection} and is independent of the Hilbert space \Hilb; the second term \LowerRateSq is an \emph{\s-dimensional estimation} term, depending only on the low dimension \s but not the ambient dimension \pdim, that captures the difficulty of estimating a sum of \s univariate functions in the Hilbert space \Hilb . The minimax rates are compared with rates achieved by an \ell_1 -penalty based approach, it can be shown that a certain \ell_1 -based approach achieves the minimax optimal rate.
A Family of Penalty Functions for Structured Sparsity
We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the \ell_1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions.
Sparse Features for PCA-Like Linear Regression
Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X \in \mathbb{R} {n \times d}, whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called \textit{eigenfeatures}), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a \textit{small} number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while \emph{provably} achieving in-sample performance comparable to regularized linear regression.