Performance Analysis
Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics
We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.
Practical Near Neighbor Search via Group Testing: Supplementary Materials
In this section, we provide proofs for all of the theorems introduced in the main text. We begin with a simple extension of the results of [3] for the Bloom filter false positive and negative rates. Then, we prove our main claim, which is that the query time of our data structure is sublinear, given some relatively weak assumptions on the stability of the query. Theorem 1. Assuming the existence of an LSH family with collision probability s(x,y) = sim(x,y), the distance-sensitive Bloom filter solves the approximate membership query problem with p 1 exp 2m t/m+ SLH We begin with a brief explanation of the results from [3]. Recall that a distance-sensitive Bloom filter is a collection of mbit arrays. Array iis indexed using an independent LSH function li(x). To insert a point xinto the ith array, we set the bit at location li(x) to '1.' To query the filter, we calculate the mhash values of the query and return "true" when at least tof the corresponding bits are '1.' To bound p (the true positive rate) and q (the false positive rate), we bound the probability that a single array returns "true."
Practical Near Neighbor Search via Group Testing
We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as "positives," non-neighbors as "negatives," and approximate membership queries as group tests.
Federated Compositional Deep AUCMaximization
Federated learning has attracted increasing attention due to the promise of balancing privacy and large-scale learning; numerous approaches have been proposed. However, most existing approaches focus on problems with balanced data, and prediction performance is far from satisfactory for many real-world applications where the number of samples in different classes is highly imbalanced. To address this challenging problem, we developed a novel federated learning method for imbalanced data by directly optimizing the area under curve (AUC) score. In particular, we formulate the AUC maximization problem as a federated compositional minimax optimization problem, develop a local stochastic compositional gradient descent ascent with momentum algorithm, and provide bounds on the computational and communication complexities of our algorithm. To the best of our knowledge, this is the first work to achieve such favorable theoretical results. Finally, extensive experimental results confirm the efficacy of our method.
Fair Classification with Adversarial Perturbations
We study fair classification in the presence of an omniscient adversary that, given an ฮท, is allowed to choose an arbitrary ฮท-fraction of the training samples and arbitrarily perturb their protected attributes. The motivation comes from settings in which protected attributes can be incorrect due to strategic misreporting, malicious actors, or errors in imputation; and prior approaches that make stochastic or independence assumptions on errors may not satisfy their guarantees in this adversarial setting. Our main contribution is an optimization framework to learn fair classifiers in this adversarial setting that comes with provable guarantees on accuracy and fairness. Our framework works with multiple and non-binary protected attributes, is designed for the large class of linear-fractional fairness metrics, and can also handle perturbations besides protected attributes. We prove near-tightness of our framework's guarantees for natural hypothesis classes: no algorithm can have significantly better accuracy and any algorithm with better fairness must have lower accuracy. Empirically, we evaluate the classifiers produced by our framework for statistical rate on real-world and synthetic datasets for a family of adversaries.