Mathematical & Statistical Methods
Planted Bipartite Graph Detection
Rotenberg, Asaf, Huleihel, Wasim, Shayevitz, Ofer
We consider the task of detecting a hidden bipartite subgraph in a given random graph. Specifically, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We derive asymptotically tight upper and lower bounds for this detection problem in both the dense regime, where $q,p = \Theta\left(1\right)$, and the sparse regime where $q,p = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. Moreover, we consider a variant of the above problem, where one can only observe a relatively small part of the graph, by using at most $\mathsf{Q}$ edge queries. For this problem, we derive upper and lower bounds in both the dense and sparse regimes.
Root Laplacian Eigenmaps with their application in spectral embedding
The root laplacian operator or the square root of Laplacian which can be obtained in complete Riemannian manifolds in the Gromov sense has an analog in graph theory as a square root of graph-Laplacian. Some potential applications have been shown in geometric deep learning (spectral clustering) and graph signal processing.
Simulation algorithms for Markovian and non-Markovian epidemics
Researchers have employed stochastic simulations to determine the validity of their theoretical findings and to study analytically intractable spreading dynamics. In both cases, the correctness and efficiency of the simulation algorithm are of paramount importance. We prove in this article that the Next Reaction Method and the non-Markovian Gillespie algorithm, two algorithms for simulating non-Markovian epidemics, are statistically equivalent. We also study the performance and applicability under various circumstances through complexity analyses and numerical experiments. In our numerical simulations, we apply the Next Reaction Method and the Gillespie algorithm to epidemic simulations on time-varying networks and epidemic simulations with cooperative infections. Both tasks have only been done using the Gillespie algorithm, while we show that the Next Reaction Method is a good alternative. We believe this article may also serve as a guide for choosing simulation algorithms that are both correct and efficient for researchers from epidemiology and beyond.
Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs
Huang, Shenyang, Coulombe, Samy, Hitti, Yasmeen, Rabbany, Reihaneh, Rabusseau, Guillaume
Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i). how to compare graph snapshots across time, ii). how to capture temporal dependencies, and iii). how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i). LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii). MultiLAD's advantage over contenders significantly increases when additional views are available, and iii). MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.
Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19
Chang, Feier, Krishnan, Jay, Hurst, Jillian H, Yarrington, Michael E, Anderson, Deverick J, O'Brien, Emily C, Goldstein, Benjamin A
To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.
A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee
Song, Zhao, Ye, Mingquan, Yin, Junze, Zhang, Lichen
Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.
Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model
Wang, Zhihai, Li, Xijun, Wang, Jie, Kuang, Yufei, Yuan, Mingxuan, Zeng, Jia, Zhang, Yongdong, Wu, Feng
Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection -- which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs -- heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model -- that formulates the cut selection task as a sequence to sequence learning problem -- to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.
Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms
Behjoo, Hamidreza, Chertkov, Michael
Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \cite{wainwright_tree-based_2002} and the loop series approach of \cite{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.
Overview of Graph Theory and Alzheimer's Disease
The Roman physician Galen was among the first people to realize that the brain controlled motor responses, cognitive function, and memory. Ever since Galen, this question has propelled the field of neuroscience. Beginning with Paul Broca's work in the 1800s, brain function has been described in terms of modular separation: each region in the brain controls a unique set of behaviors, actions, and capacities. This determination was made through observation of patients suffering neurological symptoms and connecting them to localized brain injuries. For example, Broca's area (a brain region in the inferior frontal gyrus) was found to be responsible for speech fluency (Acharya and Wroten 2022), and was discovered by studying two subjects, both of whom exhibited reduced speech capacity and suffered from lesions in the same area of the brain.
Geometric ergodicity of SGLD via reflection coupling
Li, Lei, Liu, Jian-Guo, Wang, Yuliang
The Stochastic Gradient Langevin Dynamics (SGLD), first introduced by Welling and Teh [25], has attracted a lot of attention in various areas [18, 26, 4]. The SGLD algorithm and its variants have fantastic performance when dealing with many practical sampling or optimization tasks. Recent decades have witnessed great development of theoretical research for SGLD, where most researchers focus on its discretization error, namely, the "distance" between the SGLD algorithm and the corresponding Langevin diffusion in terms of the time step (or learning rate) η [12, 18, 26, 16]. The SGLD itself can be regarded as a stochastic process and the ergodicity is also of great importance. So far, the justification of the geometric ergodicity of SGLD mostly relies on the strong convexity conditions, namely, the strong log-concaveness of the target distribution. In [4], under strong convexity settings, the authors considered the Synchronous coupling and established the geometric ergodicity of SGLD and some other numerical schemes in terms of Wasserstein-2 distance. However, the strong convexity assumption seems to limit the applicability of the result, and the ergodicity of the SGLD algorithm in a general setting and the existence of an invariant measure are still unclear. In our work, we aim to study the geometric ergodicity under locally nonconvex setting in this paper. The main technique we apply is reflection coupling [8], which was originally designed earlier to study the contraction property of many continuous SDEs.