vertex
Contradiction Graphs Determine VC Dimension
Campbell, Jesse, Ibaibarriaga, Daniel, Reyzin, Lev
The Vapnik-Chervonenkis dimension is the fundamental combinatorial parameter of distribution-free binary classification. Introduced by Vapnik and Chervonenkis in their work on uniform convergence [VC71], and closely connected to the Sauer-Shelah lemma [Sau72, She72], it characterizes classical PAC learnability [Val84, BEHW89, EHKV89]. In particular, finite VC dimension is equivalent to distribution-free learnability. This paper asks whether that finite-versus-infinite VC dichotomy is still visible after replacing a concept class by its contradiction graphs. For a binary class H {0,1}X, the order-m contradiction graph Gm(H) has as vertices the H-realizable labeled samples of length m, with an edge between two samples if they assign opposite labels to some common domain point. Throughout, samples are ordered sequences, and repetitions of domain points are allowed, subject to consistent labeling. We use the contradiction-graph framework introduced by Alon et al. in their graph-theoretic characterization of private learnability [AMSY24]. They ask whether two binary classes can have isomorphic contradiction graphs at every level while one has finite VC dimension and the other has infinite VC dimension.
Testing properties of trees in graphical models with covariance queries
Burova, Sofiya, Calvillo, Francisco, Lugosi, Gábor, Zwiernik, Piotr
We consider the problem of testing properties of graphs underlying high-dimensional graphical models. We adopt the model of covariance queries introduced by Lugosi, Truszkowski, Velona, and Zwiernik (2021). We study the case when the underlying graph is a tree. The main results of the paper show that, while reconstructing the entire tree may be costly, certain global structural properties can be tested efficiently. In particular, we design randomized tests for global structural properties that use a sub-quadratic number of queries. We develop testing procedures for several fundamental properties, including the number of leaves, the maximum degree, the typical distance, and the diameter of the tree. For each property, we obtain explicit query complexity bounds that depend on the target threshold and tolerance parameters.
A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables
Li, Zheng, Xie, Feng, Nie, Shenglan, Guo, Xichen, Wang, Ruxin, Zhang, Hao
Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.
Streaming Algorithms and Lower Bounds for Estimating Correlation Clustering Cost
Correlation clustering is a fundamental optimization problem at the intersection of machine learning and theoretical computer science. Motivated by applications to big data processing, recent years have witnessed a flurry of results on this problem in the streaming model. In this model, the algorithm needs to process the input n-vertex graph by making one or few passes over the stream of its edges and using a limited memory, much smaller than the input size. All previous work on streaming correlation clustering has focused on semistreaming algorithms with Ω(n) memory, whereas in this work, we study streaming algorithms with much smaller memory requirements of only polylog(n) bits. This stringent memory requirement is in the same spirit of classical streaming algorithms that instead of recovering a full solution to the problem--which can be prohibitively large with such small memory as is the case in our problem--, aimed to learn certain statistical properties of their inputs.
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least 0.1n steps on instances of size nbefore it encounters any of the 5nearest neighbors of the query.