bloom filter
Practical Near Neighbor Search via Group Testing: Supplementary Materials
In this section, we provide proofs for all of the theorems introduced in the main text. We begin with a simple extension of the results of [3] for the Bloom filter false positive and negative rates. Then, we prove our main claim, which is that the query time of our data structure is sublinear, given some relatively weak assumptions on the stability of the query. Theorem 1. Assuming the existence of an LSH family with collision probability s(x,y) = sim(x,y), the distance-sensitive Bloom filter solves the approximate membership query problem with p 1 exp 2m t/m+ SLH We begin with a brief explanation of the results from [3]. Recall that a distance-sensitive Bloom filter is a collection of mbit arrays. Array iis indexed using an independent LSH function li(x). To insert a point xinto the ith array, we set the bit at location li(x) to '1.' To query the filter, we calculate the mhash values of the query and return "true" when at least tof the corresponding bits are '1.' To bound p (the true positive rate) and q (the false positive rate), we bound the probability that a single array returns "true."
A Model for Learned Bloom Filters and Optimizing by Sandwiching
Recent work has suggested enhancing Bloom filters by using a pre-filter, based on applying machine learning to determine a function that models the data set the Bloom filter is meant to represent. Here we model such learned Bloom filters, with the following outcomes: (1) we clarify what guarantees can and cannot be associated with such a structure; (2) we show how to estimate what size the learning function must obtain in order to obtain improved performance; (3) we provide a simple method, sandwiching, for optimizing learned Bloom filters; and (4) we propose a design and analysis approach for a learned Bloomier filter, based on our modeling approach.
gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs
Sabir, Humera, Farooq, Fatima, Aboulnaga, Ashraf
Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node's global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.