Goto

Collaborating Authors

 Li, Runze


NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

arXiv.org Artificial Intelligence

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.


Post-hoc Interpretability Illumination for Scientific Interaction Discovery

arXiv.org Machine Learning

Model interpretability and explainability have garnered substantial attention in recent years, particularly in decision-making applications. However, existing interpretability tools often fall short in delivering satisfactory performance due to limited capabilities or efficiency issues. To address these challenges, we propose a novel post-hoc method: Iterative Kings' Forests (iKF), designed to uncover complex multi-order interactions among variables. iKF iteratively selects the next most important variable, the "King", and constructs King's Forests by placing it at the root node of each tree to identify variables that interact with the "King". It then generates ranked short lists of important variables and interactions of varying orders. Additionally, iKF provides inference metrics to analyze the patterns of the selected interactions and classify them into one of three interaction types: Accompanied Interaction, Synergistic Interaction, and Hierarchical Interaction. Extensive experiments demonstrate the strong interpretive power of our proposed iKF, highlighting its great potential for explainable modeling and scientific discovery across diverse scientific fields.


Hypothesis Testing for High-Dimensional Matrix-Valued Data

arXiv.org Machine Learning

This paper addresses hypothesis testing for the mean of matrix-valued data in high-dimensional settings. We investigate the minimum discrepancy test, originally proposed by Cragg (1997), which serves as a rank test for lower-dimensional matrices. We evaluate the performance of this test as the matrix dimensions increase proportionally with the sample size, and identify its limitations when matrix dimensions significantly exceed the sample size. To address these challenges, we propose a new test statistic tailored for high-dimensional matrix rank testing. The oracle version of this statistic is analyzed to highlight its theoretical properties. Additionally, we develop a novel approach for constructing a sparse singular value decomposition (SVD) estimator for singular vectors, providing a comprehensive examination of its theoretical aspects. Using the sparse SVD estimator, we explore the properties of the sample version of our proposed statistic. The paper concludes with simulation studies and two case studies involving surveillance video data, demonstrating the practical utility of our proposed methods.


Statistical Convergence Rates of Optimal Transport Map Estimation between General Distributions

arXiv.org Machine Learning

This paper studies the convergence rates of optimal transport (OT) map estimators, a topic of growing interest in statistics, machine learning, and various scientific fields. Despite recent advancements, existing results rely on regularity assumptions that are very restrictive in practice and much stricter than those in Brenier's Theorem, including the compactness and convexity of the probability support and the bi-Lipschitz property of the OT maps. We aim to broaden the scope of OT map estimation and fill this gap between theory and practice. Given the strong convexity assumption on Brenier's potential, we first establish the non-asymptotic convergence rates for the original plug-in estimator without requiring restrictive assumptions on probability measures. Additionally, we introduce a sieve plug-in estimator and establish its convergence rates without the strong convexity assumption on Brenier's potential, enabling the widely used cases such as the rank functions of normal or t-distributions. We also establish new Poincar\'e-type inequalities, which are proved given sufficient conditions on the local boundedness of the probability density and mild topological conditions of the support, and these new inequalities enable us to achieve faster convergence rates for the Donsker function class. Moreover, we develop scalable algorithms to efficiently solve the OT map estimation using neural networks and present numerical experiments to demonstrate the effectiveness and robustness.


The Effect of Personalization in FedProx: A Fine-grained Analysis on Statistical Accuracy and Communication Efficiency

arXiv.org Machine Learning

FedProx is a simple yet effective federated learning method that enables model personalization via regularization. Despite remarkable success in practice, a rigorous analysis of how such a regularization provably improves the statistical accuracy of each client's local model hasn't been fully established. Setting the regularization strength heuristically presents a risk, as an inappropriate choice may even degrade accuracy. This work fills in the gap by analyzing the effect of regularization on statistical accuracy, thereby providing a theoretical guideline for setting the regularization strength for achieving personalization. We prove that by adaptively choosing the regularization strength under different statistical heterogeneity, FedProx can consistently outperform pure local training and achieve a minimaxoptimal statistical rate. In addition, to shed light on resource allocation, we design an algorithm, provably showing that stronger personalization reduces communication complexity without increasing the computation cost overhead. Finally, our theory is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting. Federated Learning (FL) has emerged as an attractive framework for aggregating distributed data, enabling clients to collaboratively train a shared global model while preserving data privacy. In the currently prevalent paradigm (McMahan et al., 2017), FL is formulated as a finite sum minimization problem focusing on a single shared model. Nevertheless, it has been well recognized that one of the key challenges in FL is the statistical heterogeneity of the client datasets. As the participants collect their own local data, it often reflects client-specific characteristics and is not identically distributed. With high statistical heterogeneity, training a single model for all clients by minimizing their average in-sample loss becomes questionable. To address this challenge, one solution is to relax the common model constraint and solve alternatively the following objective in FedProx (Li et al., 2020a): ( min p The smaller λ is, the weaker the coupling of the local models the formulation would enforce thus the higher personalization is.


Conditional score-based diffusion models for solving inverse problems in mechanics

arXiv.org Machine Learning

We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.


TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression

arXiv.org Machine Learning

The main challenge that sets transfer learning apart from traditional supervised learning is the distribution shift, reflected as the shift between the source and target models and that between the marginal covariate distributions. In this work, we tackle model shifts in the presence of covariate shifts in the high-dimensional regression setting. Specifically, we propose a two-step method with a novel fused-regularizer that effectively leverages samples from source tasks to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model, showing the robustness of the proposed method to covariate shifts. We further establish conditions under which the estimator is minimax-optimal. Additionally, we extend the method to a distributed setting, allowing for a pretraining-finetuning strategy, requiring just one round of communication while retaining the estimation rate of the centralized version. Numerical tests validate our theory, highlighting the method's robustness to covariate shifts.


AdaTrans: Feature-wise and Sample-wise Adaptive Transfer Learning for High-dimensional Regression

arXiv.org Machine Learning

We consider the transfer learning problem in the high dimensional setting, where the feature dimension is larger than the sample size. To learn transferable information, which may vary across features or the source samples, we propose an adaptive transfer learning method that can detect and aggregate the feature-wise (F-AdaTrans) or sample-wise (S-AdaTrans) transferable structures. We achieve this by employing a novel fused-penalty, coupled with weights that can adapt according to the transferable structure. To choose the weight, we propose a theoretically informed, data-driven procedure, enabling F-AdaTrans to selectively fuse the transferable signals with the target while filtering out non-transferable signals, and S-AdaTrans to obtain the optimal combination of information transferred from each source sample. The non-asymptotic rates are established, which recover existing near-minimax optimal rates in special cases. The effectiveness of the proposed method is validated using both synthetic and real data.


Enhancing Robustness of Gradient-Boosted Decision Trees through One-Hot Encoding and Regularization

arXiv.org Artificial Intelligence

Gradient-boosted decision trees (GBDT) are widely used and highly effective machine learning approach for tabular data modeling. However, their complex structure may lead to low robustness against small covariate perturbation in unseen data. In this study, we apply one-hot encoding to convert a GBDT model into a linear framework, through encoding of each tree leaf to one dummy variable. This allows for the use of linear regression techniques, plus a novel risk decomposition for assessing the robustness of a GBDT model against covariate perturbations. We propose to enhance the robustness of GBDT models by refitting their linear regression forms with $L_1$ or $L_2$ regularization. Theoretical results are obtained about the effect of regularization on the model performance and robustness. It is demonstrated through numerical experiments that the proposed regularization approach can enhance the robustness of the one-hot-encoded GBDT models.


Detection and Estimation of Structural Breaks in High-Dimensional Functional Time Series

arXiv.org Machine Learning

Modelling functional time series, time series of random functions defined within a finite interval, has became one of the main frontiers of developments in time series models. Various functional linear and nonlinear time series models have been proposed and extensively studied in the past two decades (e.g., Bosq, 2000; Hörmann and Kokoszka, 2010; Horváth and Kokoszka, 2012; Hörmann, Horváth and Reeder, 2013; Li, Robinson and Shang, 2020). These models together with relevant methodologies have been applied to various fields such as biology, demography, economics, environmental science and finance. However, the model frameworks and methodologies developed in the aforementioned literature heavily rely on the stationarity assumption, which is often rejected when testing the functional time series data in practice. For example, Horváth, Kokoszka and Rice (2014) find evidence of nonstationarity for intraday price curves of some stocks collected in the US market; Aue, Rice and Sönmez (2018) reject the null hypothesis of stationarity for the temperature curves collected in Australia; and Li, Robinson and Shang (2023) reveal evidence of nonstationary feature for the functional time series constructed from the age-and sex-specific life-table death counts. It thus becomes imperative to test whether the collected functional time series are stationary. The primary interest of this paper is to test whether there exist structural breaks in the mean function over time and subsequently estimate locations of breaks if they do exist. There have been increasing interests on detecting and estimating structural breaks in functional time series. Broadly speaking, there are two types of detection techniques.