Atlantic County
VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software
Huang, Chung-ju, Zhang, Ziqi, Wang, Yinggui, Wang, Binghui, Wei, Tao, Wang, Leye
Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants' data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party's inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party's computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA's random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.
Outsourced Privacy-Preserving Feature Selection Based on Fully Homomorphic Encryption
Wakiyama, Koki, I, Tomohiro, Sakamoto, Hiroshi
Feature selection is a technique that extracts a meaningful subset from a set of features in training data. When the training data is large-scale, appropriate feature selection enables the removal of redundant features, which can improve generalization performance, accelerate the training process, and enhance the interpretability of the model. This study proposes a privacy-preserving computation model for feature selection. Generally, when the data owner and analyst are the same, there is no need to conceal the private information. However, when they are different parties or when multiple owners exist, an appropriate privacy-preserving framework is required. Although various private feature selection algorithms, they all require two or more computing parties and do not guarantee security in environments where no external party can be fully trusted. To address this issue, we propose the first outsourcing algorithm for feature selection using fully homomorphic encryption. Compared to a prior two-party algorithm, our result improves the time and space complexity O(kn^2) to O(kn log^3 n) and O(kn), where k and n denote the number of features and data samples, respectively. We also implemented the proposed algorithm and conducted comparative experiments with the naive one. The experimental result shows the efficiency of our method even with small datasets.
Design and Analysis of an Extreme-Scale, High-Performance, and Modular Agent-Based Simulation Platform
Agent-based modeling is indispensable for studying complex systems across many domains. However, existing simulation platforms exhibit two major issues: performance and modularity. Low performance prevents simulations with a large number of agents, increases development time, limits parameter exploration, and raises computing costs. Inflexible software designs motivate modelers to create their own tools, diverting valuable resources. This dissertation introduces a novel simulation platform called BioDynaMo and its significant improvement, TeraAgent, to alleviate these challenges via three major works. First, we lay the platform's foundation by defining abstractions, establishing software infrastructure, and implementing a multitude of features for agent-based modeling. We demonstrate BioDynaMo's modularity through use cases in neuroscience, epidemiology, and oncology. We validate these models and show the simplicity of adding new functionality with few lines of code. Second, we perform a rigorous performance analysis and identify challenges for shared-memory parallelism. Provided solutions include an optimized grid for neighbor searching, mechanisms to reduce the memory access latency, and exploiting domain knowledge to omit unnecessary work. These improvements yield up to three orders of magnitude speedups, enabling simulations of 1.7 billion agents on a single server. Third, we present TeraAgent, a distributed simulation engine that allows scaling out the computation of one simulation to multiple servers. We identify and address server communication bottlenecks and implement solutions for serialization and delta encoding to accelerate and reduce data transfer. TeraAgent can simulate 500 billion agents and scales to 84096 CPU cores. BioDynaMo has been widely adopted, including a prize-winning radiotherapy simulation recognized as a top 10 breakthrough in physics in 2024.
Incremental IVF Index Maintenance for Streaming Vector Search
Mohoney, Jason, Pacaci, Anil, Chowdhury, Shihabur Rahman, Minhas, Umar Farooq, Pound, Jeffery, Renggli, Cedric, Reyhani, Nima, Ilyas, Ihab F., Rekatsinas, Theodoros, Venkataraman, Shivaram
The prevalence of vector similarity search in modern machine IVF indexes out-of-the-box do not have the notion of inserting learning applications and the continuously changing nature of data new vectors or deleting existing vectors once constructed. Indeed, processed by these applications necessitate efficient and effective the most common method used by practitioners today is to rebuild index maintenance techniques for vector search indexes. Designed the index from scratch to reflect any updates that have accumulated primarily for static workloads, existing vector search indexes degrade over time. However, depending on the scale of the vector in search quality and performance as the underlying data is dataset and the volume and frequency of updates, a full index rebuild updated unless costly index reconstruction is performed. To address can be prohibitively expensive. For example, it takes multiple this, we introduce Ada-IVF, an incremental indexing methodology days to rebuild an IVF index from scratch for billion-scale vector for Inverted File (IVF) indexes. Ada-IVF consists of 1) an adaptive datasets [21, 69], making it necessary to revisit how updates can maintenance policy that decides which index partitions are problematic be reflected. Devising such an update mechanism consists of readjusting for performance and should be repartitioned and 2) a local the partitioning of the high-dimensional space defined by re-clustering mechanism that determines how to repartition them.