Goto

Collaborating Authors

 Regression


PCM Selector: Penalized Covariate-Mediator Selection Operator for Evaluating Linear Causal Effects

arXiv.org Artificial Intelligence

For a data-generating process for random variables that can be described with a linear structural equation model, we consider a situation in which (i) a set of covariates satisfying the back-door criterion cannot be observed or (ii) such a set can be observed, but standard statistical estimation methods cannot be applied to estimate causal effects because of multicollinearity/high-dimensional data problems. We propose a novel two-stage penalized regression approach, the penalized covariate-mediator selection operator (PCM Selector), to estimate the causal effects in such scenarios. Unlike existing penalized regression analyses, when a set of intermediate variables is available, PCM Selector provides a consistent or less biased estimator of the causal effect. In addition, PCM Selector provides a variable selection procedure for intermediate variables to obtain better estimation accuracy of the causal effects than does the back-door criterion.


HNCI: High-Dimensional Network Causal Inference

arXiv.org Machine Learning

The problem of evaluating the effectiveness of a treatment or policy commonly appears in causal inference applications under network interference. In this paper, we suggest the new method of high-dimensional network causal inference (HNCI) that provides both valid confidence interval on the average direct treatment effect on the treated (ADET) and valid confidence set for the neighborhood size for interference effect. We exploit the model setting in Belloni et al. (2022) and allow certain type of heterogeneity in node interference neighborhood sizes. We propose a linear regression formulation of potential outcomes, where the regression coefficients correspond to the underlying true interference function values of nodes and exhibit a latent homogeneous structure. Such a formulation allows us to leverage existing literature from linear regression and homogeneity pursuit to conduct valid statistical inferences with theoretical guarantees. The resulting confidence intervals for the ADET are formally justified through asymptotic normalities with estimable variances. We further provide the confidence set for the neighborhood size with theoretical guarantees exploiting the repro samples approach. The practical utilities of the newly suggested methods are demonstrated through simulation and real data examples.


Fr\'echet regression for multi-label feature selection with implicit regularization

arXiv.org Machine Learning

Fréchet regression, an extension of classical linear regression to general metric spaces, offers a robust framework for modeling complex relationships between variables when the responses lie outside of Euclidean spaces. This approach is especially well suited to high-dimensional datasets, such as vector representations, with particular relevance to fields like imaging, where capturing nonlinear dependencies and the intrinsic data structure is critical for accurate modeling (Fréchet (1948), Petersen and Müller (2019), Bhattacharjee and Müller (2023), Qiu, Yu and Zhu (2024)). A significant consideration in Fréchet regression arises when predicting multiple responses simultaneously, as seen in multi-target or multidimensional problems (Zhang and Zhou (2007), Hyvönen, Jääsaari and Roos (2024)). Unlike traditional regression, where each observation corresponds to a single response, Fréchet regression can be extended to model complex interactions between multiple outputs. This ability to address complex relationships between several responses opens new avenues, particularly in fields such as bioinformatics (Huang et al. (2005)) and image analysis (Lathuilière et al. (2019)), where multidimensional data and interdependencies between responses require adaptive and specialized methodologies. However, to date, the handling of multilabel scenarios within the context of Fréchet regression remains relatively unexplored in the literature, despite its potential significance in addressing complex, multidimensional applications. In this paper, we present an extension of the Global Fréchet regression model, a specific variant of Fréchet regression that generalizes classical multiple linear regression by modeling responses as random objects. This extension enables the explicit modeling of relationships between input variables and multiple responses, thereby addressing the multi-label setting. Our second contribution in this paper addresses the dimensionality challenge in the context of the proposed Fréchet regression extension.


Bivariate Matrix-valued Linear Regression (BMLR): Finite-sample performance under Identifiability and Sparsity Assumptions

arXiv.org Machine Learning

This study explores the estimation of parameters in a matrix-valued linear regression model, where the $T$ responses $(Y_t)_{t=1}^T \in \mathbb{R}^{n \times p}$ and predictors $(X_t)_{t=1}^T \in \mathbb{R}^{m \times q}$ satisfy the relationship $Y_t = A^* X_t B^* + E_t$ for all $t = 1, \ldots, T$. In this model, $A^* \in \mathbb{R}_+^{n \times m}$ has $L_1$-normalized rows, $B^* \in \mathbb{R}^{q \times p}$, and $(E_t)_{t=1}^T$ are independent noise matrices following a matrix Gaussian distribution. The primary objective is to estimate the unknown parameters $A^*$ and $B^*$ efficiently. We propose explicit optimization-free estimators and establish non-asymptotic convergence rates to quantify their performance. Additionally, we extend our analysis to scenarios where $A^*$ and $B^*$ exhibit sparse structures. To support our theoretical findings, we conduct numerical simulations that confirm the behavior of the estimators, particularly with respect to the impact of the dimensions $n, m, p, q$, and the sample size $T$ on finite-sample performances. We complete the simulations by investigating the denoising performances of our estimators on noisy real-world images.


Heterogeneous transfer learning for high dimensional regression with feature mismatch

arXiv.org Machine Learning

We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the same feature space, limiting their practical applicability. In applications, target and proxy feature spaces are frequently inherently different, for example, due to the inability to measure some variables in the target data-poor environments. Conversely, existing heterogeneous transfer learning methods do not provide statistical error guarantees, limiting their utility for scientific discovery. We propose a two-stage method that involves learning the relationship between the missing and observed features through a projection step in the proxy data and then solving a joint penalized regression optimization problem in the target data. We develop an upper bound on the method's parameter estimation risk and prediction risk, assuming that the proxy and the target domain parameters are sparsely different. Our results elucidate how estimation and prediction error depend on the complexity of the model, sample size, the extent of overlap, and correlation between matched and mismatched features.


Learning Randomized Reductions and Program Properties

arXiv.org Artificial Intelligence

The correctness of computations remains a significant challenge in computer science, with traditional approaches relying on automated testing or formal verification. Self-testing/correcting programs introduce an alternative paradigm, allowing a program to verify and correct its own outputs via randomized reductions, a concept that previously required manual derivation. In this paper, we present Bitween, a method and tool for automated learning of randomized (self)-reductions and program properties in numerical programs. Bitween combines symbolic analysis and machine learning, with a surprising finding: polynomial-time linear regression, a basic optimization method, is not only sufficient but also highly effective for deriving complex randomized self-reductions and program invariants, often outperforming sophisticated mixed-integer linear programming solvers. We establish a theoretical framework for learning these reductions and introduce RSR-Bench, a benchmark suite for evaluating Bitween's capabilities on scientific and machine learning functions. Our empirical results show that Bitween surpasses state-of-the-art tools in scalability, stability, and sample efficiency when evaluated on nonlinear invariant benchmarks like NLA-DigBench. Bitween is open-source as a Python package and accessible via a web interface that supports C language programs.


Learning from Summarized Data: Gaussian Process Regression with Sample Quasi-Likelihood

arXiv.org Machine Learning

Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using non-Gaussian likelihoods. To deal with various tasks in spatial modeling, we benefit from this development. Difficulties still arise when we can only access summarized data consisting of representative features, summary statistics, and data point counts. Such situations frequently occur primarily due to concerns about confidentiality and management costs associated with spatial data. This study tackles learning and inference using only summarized data within the framework of Gaussian process regression. To address this challenge, we analyze the approximation errors in the marginal likelihood and posterior distribution that arise from utilizing representative features. We also introduce the concept of sample quasi-likelihood, which facilitates learning and inference using only summarized data. Non-Gaussian likelihoods satisfying certain assumptions can be captured by specifying a variance function that characterizes a sample quasi-likelihood function. Theoretical and experimental results demonstrate that the approximation performance is influenced by the granularity of summarized data relative to the length scale of covariance functions. Experiments on a real-world dataset highlight the practicality of our method for spatial modeling.


An Instrumental Value for Data Production and its Application to Data Pricing

arXiv.org Artificial Intelligence

How much value does a dataset or a data production process have to an agent who wishes to use the data to assist decision-making? This is a fundamental question towards understanding the value of data as well as further pricing of data. This paper develops an approach for capturing the instrumental value of data production processes, which takes two key factors into account: (a) the context of the agent's decision-making problem; (b) prior data or information the agent already possesses. We ''micro-found'' our valuation concepts by showing how they connect to classic notions of information design and signals in information economics. When instantiated in the domain of Bayesian linear regression, our value naturally corresponds to information gain. Based on our designed data value, we then study a basic monopoly pricing setting with a buyer looking to purchase from a seller some labeled data of a certain feature direction in order to improve a Bayesian regression model. We show that when the seller has the ability to fully customize any data request, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If the seller can only sell data that are derived from an existing data pool, this limits her ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log (\kappa)$ less than the first-best revenue, where $\kappa$ is the condition number associated with the data matrix. A corollary of this result is that the seller can extract the first-best revenue in the multi-armed bandits special case.


An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

arXiv.org Artificial Intelligence

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.


A Semi-supervised CART Model for Covariate Shift

arXiv.org Artificial Intelligence

Machine learning models used in medical applications often face challenges due to the covariate shift, which occurs when there are discrepancies between the distributions of training and target data. This can lead to decreased predictive accuracy, especially with unknown outcomes in the target data. This paper introduces a semi-supervised classification and regression tree (CART) that uses importance weighting to address these distribution discrepancies. Our method improves the predictive performance of the CART model by assigning greater weights to training samples that more accurately represent the target distribution, especially in cases of covariate shift without target outcomes. In addition to CART, we extend this weighted approach to generalized linear model trees and tree ensembles, creating a versatile framework for managing the covariate shift in complex datasets. Through simulation studies and applications to real-world medical data, we demonstrate significant improvements in predictive accuracy. These findings suggest that our weighted approach can enhance reliability in medical applications and other fields where the covariate shift poses challenges to model performance across various data distributions.