Dasarathy, Gautam
Advanced Tutorial: Label-Efficient Two-Sample Tests
Li, Weizhi, Berisha, Visar, Dasarathy, Gautam
Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this \textit{label-costly} setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.
Structure Learning in Gaussian Graphical Models from Glauber Dynamics
Tirukkonda, Vignesh, Rayas, Anirudh, Dasarathy, Gautam
Gaussian graphical model selection is an important paradigm with numerous applications, including biological network modeling, financial network modeling, and social network analysis. Traditional approaches assume access to independent and identically distributed (i.i.d) samples, which is often impractical in real-world scenarios. In this paper, we address Gaussian graphical model selection under observations from a more realistic dependent stochastic process known as Glauber dynamics. Glauber dynamics, also called the Gibbs sampler, is a Markov chain that sequentially updates the variables of the underlying model based on the statistics of the remaining model. Such models, aside from frequently being employed to generate samples from complex multivariate distributions, naturally arise in various settings, such as opinion consensus in social networks and clearing/stock-price dynamics in financial networks. In contrast to the extensive body of existing work, we present the first algorithm for Gaussian graphical model selection when data are sampled according to the Glauber dynamics. We provide theoretical guarantees on the computational and statistical complexity of the proposed algorithm's structure learning performance. Additionally, we provide information-theoretic lower bounds on the statistical complexity and show that our algorithm is nearly minimax optimal for a broad class of problems.
Learning Networks from Wide-Sense Stationary Stochastic Processes
Rayas, Anirudh, Cheng, Jiajun, Anguluri, Rajasekhar, Deka, Deepjyoti, Dasarathy, Gautam
Complex networked systems driven by latent inputs are common in fields like neuroscience, finance, and engineering. A key inference problem here is to learn edge connectivity from node outputs (potentials). We focus on systems governed by steady-state linear conservation laws: $X_t = {L^{\ast}}Y_{t}$, where $X_t, Y_t \in \mathbb{R}^p$ denote inputs and potentials, respectively, and the sparsity pattern of the $p \times p$ Laplacian $L^{\ast}$ encodes the edge structure. Assuming $X_t$ to be a wide-sense stationary stochastic process with a known spectral density matrix, we learn the support of $L^{\ast}$ from temporally correlated samples of $Y_t$ via an $\ell_1$-regularized Whittle's maximum likelihood estimator (MLE). The regularization is particularly useful for learning large-scale networks in the high-dimensional setting where the network size $p$ significantly exceeds the number of samples $n$. We show that the MLE problem is strictly convex, admitting a unique solution. Under a novel mutual incoherence condition and certain sufficient conditions on $(n, p, d)$, we show that the ML estimate recovers the sparsity pattern of $L^\ast$ with high probability, where $d$ is the maximum degree of the graph underlying $L^{\ast}$. We provide recovery guarantees for $L^\ast$ in element-wise maximum, Frobenius, and operator norms. Finally, we complement our theoretical results with several simulation studies on synthetic and benchmark datasets, including engineered systems (power and water networks), and real-world datasets from neural systems (such as the human brain).
Communication-Efficient Federated Learning over Wireless Channels via Gradient Sketching
Gattani, Vineet Sunil, Zhang, Junshan, Dasarathy, Gautam
Large-scale federated learning (FL) over wireless multiple access channels (MACs) has emerged as a crucial learning paradigm with a wide range of applications. However, its widespread adoption is hindered by several major challenges, including limited bandwidth shared by many edge devices, noisy and erroneous wireless communications, and heterogeneous datasets with different distributions across edge devices. To overcome these fundamental challenges, we propose Federated Proximal Sketching (FPS), tailored towards band-limited wireless channels and handling data heterogeneity across edge devices. FPS uses a count sketch data structure to address the bandwidth bottleneck and enable efficient compression while maintaining accurate estimation of significant coordinates. Additionally, we modify the loss function in FPS such that it is equipped to deal with varying degrees of data heterogeneity. We establish the convergence of the FPS algorithm under mild technical conditions and characterize how the bias induced due to factors like data heterogeneity and noisy wireless channels play a role in the overall result. We complement the proposed theoretical framework with numerical experiments that demonstrate the stability, accuracy, and efficiency of FPS in comparison to state-of-the-art methods on both synthetic and real-world datasets. Overall, our results show that FPS is a promising solution to tackling the above challenges of FL over wireless MACs.
Unraveling overoptimism and publication bias in ML-driven science
Saidi, Pouria, Dasarathy, Gautam, Berisha, Visar
Machine Learning (ML) is increasingly used across many disciplines with impressive reported results. However, recent studies suggest published performance of ML models are often overoptimistic. Validity concerns are underscored by findings of an inverse relationship between sample size and reported accuracy in published ML models, contrasting with the theory of learning curves where accuracy should improve or remain stable with increasing sample size. This paper investigates factors contributing to overoptimism in ML-driven science, focusing on overfitting and publication bias. We introduce a novel stochastic model for observed accuracy, integrating parametric learning curves and the aforementioned biases. We construct an estimator that corrects for these biases in observed data. Theoretical and empirical results show that our framework can estimate the underlying learning curve, providing realistic performance assessments from published results. Applying the model to meta-analyses of classifications of neurological conditions, we estimate the inherent limits of ML-based prediction in each domain.
Active Sequential Two-Sample Testing
Li, Weizhi, Ramamurthy, Karthikeyan Natesan, Kadambi, Prad, Saidi, Pouria, Dasarathy, Gautam, Berisha, Visar
Two-sample testing tests whether the distributions generating two samples are identical. We pose the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. We devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries} sample labels to address the problem. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is given by a classification model. The classification model is adaptively updated and then used to guide an active query scheme called bimodal query to label sample features in the regions with high dependency between the feature variables and the label variables. The theoretical contributions in the paper include proof that our framework produces an \emph{anytime-valid} $p$-value; and, under reachable conditions and a mild assumption, the framework asymptotically generates a minimum normalized log-likelihood ratio statistic that a passive query scheme can only achieve when the feature variable and the label variable have the highest dependence. Lastly, we provide a \emph{query-switching (QS)} algorithm to decide when to switch from passive query to active query and adapt bimodal query to increase the testing power of our test. Extensive experiments justify our theoretical contributions and the effectiveness of QS.
A Machine Learning Framework for Event Identification via Modal Analysis of PMU Data
Bazargani, Nima T., Dasarathy, Gautam, Sankar, Lalitha, Kosut, Oliver
Power systems are prone to a variety of events (e.g. line trips and generation loss) and real-time identification of such events is crucial in terms of situational awareness, reliability, and security. Using measurements from multiple synchrophasors, i.e., phasor measurement units (PMUs), we propose to identify events by extracting features based on modal dynamics. We combine such traditional physics-based feature extraction methods with machine learning to distinguish different event types. Including all measurement channels at each PMU allows exploiting diverse features but also requires learning classification models over a high-dimensional space. To address this issue, various feature selection methods are implemented to choose the best subset of features. Using the obtained subset of features, we investigate the performance of two well-known classification models, namely, logistic regression (LR) and support vector machines (SVM) to identify generation loss and line trip events in two datasets. The first dataset is obtained from simulated generation loss and line trip events in the Texas 2000-bus synthetic grid. The second is a proprietary dataset with labeled events obtained from a large utility in the USA involving measurements from nearly 500 PMUs. Our results indicate that the proposed framework is promising for identifying the two types of events.
Controllability of Coarsely Measured Networked Linear Dynamical Systems (Extended Version)
Ghoroghchian, Nafiseh, Anguluri, Rajasekhar, Dasarathy, Gautam, Draper, Stark C.
We consider the controllability of large-scale linear networked dynamical systems when complete knowledge of network structure is unavailable and knowledge is limited to coarse summaries. We provide conditions under which average controllability of the fine-scale system can be well approximated by average controllability of the (synthesized, reduced-order) coarse-scale system. To this end, we require knowledge of some inherent parametric structure of the fine-scale network that makes this type of approximation possible. Therefore, we assume that the underlying fine-scale network is generated by the stochastic block model (SBM) -- often studied in community detection. We then provide an algorithm that directly estimates the average controllability of the fine-scale system using a coarse summary of SBM. Our analysis indicates the necessity of underlying structure (e.g., in-built communities) to be able to quantify accurately the controllability from coarsely characterized networked dynamics. We also compare our method to that of the reduced-order method and highlight the regimes where both can outperform each other. Finally, we provide simulations to confirm our theoretical results for different scalings of network size and density, and the parameter that captures how much community-structure is retained in the coarse summary.
Learning the Structure of Large Networked Systems Obeying Conservation Laws
Rayas, Anirudh, Anguluri, Rajasekhar, Dasarathy, Gautam
Many networked systems such as electric networks, the brain, and social networks of opinion dynamics are known to obey conservation laws. Examples of this phenomenon include the Kirchoff laws in electric networks and opinion consensus in social networks. Conservation laws in networked systems may be modeled as balance equations of the form $X = B^{*} Y$, where the sparsity pattern of $B^{*}$ captures the connectivity of the network, and $Y, X \in \mathbb{R}^p$ are vectors of "potentials" and "injected flows" at the nodes respectively. The node potentials $Y$ cause flows across edges and the flows $X$ injected at the nodes are extraneous to the network dynamics. In several practical systems, the network structure is often unknown and needs to be estimated from data. Towards this, one has access to samples of the node potentials $Y$, but only the statistics of the node injections $X$. Motivated by this important problem, we study the estimation of the sparsity structure of the matrix $B^{*}$ from $n$ samples of $Y$ under the assumption that the node injections $X$ follow a Gaussian distribution with a known covariance $\Sigma_X$. We propose a new $\ell_{1}$-regularized maximum likelihood estimator for this problem in the high-dimensional regime where the size of the network $p$ is larger than sample size $n$. We show that this optimization problem is convex in the objective and admits a unique solution. Under a new mutual incoherence condition, we establish sufficient conditions on the triple $(n,p,d)$ for which exact sparsity recovery of $B^{*}$ is possible with high probability; $d$ is the degree of the graph. We also establish guarantees for the recovery of $B^{*}$ in the element-wise maximum, Frobenius, and operator norms. Finally, we complement these theoretical results with experimental validation of the performance of the proposed estimator on synthetic and real-world data.
A label efficient two-sample test
Li, Weizhi, Dasarathy, Gautam, Ramamurthy, Karthikeyan Natesan, Berisha, Visar
Two-sample tests evaluate whether two samples are realizations of the same distribution (the null hypothesis) or two different distributions (the alternative hypothesis). In the traditional formulation of this problem, the statistician has access to both the measurements (feature variables) and the group variable (label variable). However, in several important applications, feature variables can be easily measured but the binary label variable is unknown and costly to obtain. In this paper, we consider this important variation on the classical two-sample test problem and pose it as a problem of obtaining the labels of only a small number of samples in service of performing a two-sample test. We devise a label efficient three-stage framework: firstly, a classifier is trained with samples uniformly labeled to model the posterior probabilities of the labels; secondly, a novel query scheme dubbed \emph{bimodal query} is used to query labels of samples from both classes with maximum posterior probabilities, and lastly, the classical Friedman-Rafsky (FR) two-sample test is performed on the queried samples. Our theoretical analysis shows that bimodal query is optimal for two-sample testing using the FR statistic under reasonable conditions and that the three-stage framework controls the Type I error. Extensive experiments performed on synthetic, benchmark, and application-specific datasets demonstrate that the three-stage framework has decreased Type II error over uniform querying and certainty-based querying with same number of labels while controlling the Type I error. Source code for our algorithms and experimental results is available at https://github.com/wayne0908/Label-Efficient-Two-Sample.