deletion
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.04)
- (2 more...)
Graph Edit Distance with General Costs Using Neural Set Divergence
Graph Edit Distance (GED) measures the (dis-)similarity between two given graphs in terms of the minimum-cost edit sequence, which transforms one graph to the other.GED is related to other notions of graph similarity, such as graph and subgraph isomorphism, maximum common subgraph, etc. However, the computation of exact GED is NP-Hard, which has recently motivated the design of neural models for GED estimation.However, they do not explicitly account for edit operations with different costs. In response, we propose $\texttt{GraphEdX}$, a neural GED estimator that can work with general costs specified for the four edit operations, viz., edge deletion, edge addition, node deletion, and node addition.We first present GED as a quadratic assignment problem (QAP) that incorporates these four costs.Then, we represent each graph as a set of node and edge embeddings and use them to design a family of neural set divergence surrogates. We replace the QAP terms corresponding to each operation with their surrogates. Computing such neural set divergence requires aligning nodes and edges of the two graphs.We learn these alignments using a Gumbel-Sinkhorn permutation generator, additionally ensuring that the node and edge alignments are consistent with each other. Moreover, these alignments are cognizant of both the presence and absence of edges between node pairs.Through extensive experiments on several datasets, along with a variety of edit cost settings, we show that $\texttt{GraphEdX}$ consistently outperforms state-of-the-art methods and heuristics in terms of prediction error.
RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion
Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.
Edit Distance Robust Watermarks via Indexing Pseudorandom Codes
Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn, & Zamir (2023) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both (a) and (b) when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our codes have the additional benefit of relying on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.
Faster Query Times for Fully Dynamic k -Center Clustering with Outliers
Given a point set $P\subseteq M$ from a metric space $(M,d)$ and numbers $k, z \in N$, the *metric $k$-center problem with $z$ outliers* is to find a set $C^\ast\subseteq P$ of $k$ points such that the maximum distance of all but at most $z$ outlier points of $P$ to their nearest center in ${C}^\ast$ is minimized. We consider this problem in the fully dynamic model, i.e., under insertions and deletions of points, for the case that the metric space has a bounded doubling dimension $dim$. We utilize a hierarchical data structure to maintain the points and their neighborhoods, which enables us to efficiently find the clusters. In particular, our data structure can be queried at any time to generate a $(3+\varepsilon)$-approximate solution for input values of $k$ and $z$ in worst-case query time $\varepsilon^{-O(dim)}k \log{n} \log\log{\Delta}$, where $\Delta$ is the ratio between the maximum and minimum distance between two points in $P$. Moreover, it allows insertion/deletion of a point in worst-case update time $\varepsilon^{-O(dim)}\log{n}\log{\Delta}$. Our result achieves a significantly faster query time with respect to $k$ and $z$ than the current state-of-the-art by Pellizzoni, Pietracaprina, and Pucci, which uses $\varepsilon^{-O(dim)}(k+z)^2\log{\Delta}$ query time to obtain a $(3+\varepsilon)$-approximation.
How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?
Yamashita, Tomohiro, Amagata, Daichi, Matsui, Yusuke
Approximate Nearest Neighbor Search (ANNS) has recently gained significant attention due to its many applications, such as Retrieval-Augmented Generation. Such applications require ANNS algorithms that support dynamic data, so the ANNS problem on dynamic data has attracted considerable interest. However, a comprehensive evaluation methodology for data deletion in ANNS has yet to be established. This study proposes an experimental framework and comprehensive evaluation metrics to assess the efficiency of data deletion for ANNS indexes under practical use cases. Specifically, we categorize data deletion methods in graph-based ANNS into three approaches and formalize them mathematically. The performance is assessed in terms of accuracy, query speed, and other relevant metrics. Finally, we apply the proposed evaluation framework to Hierarchical Navigable Small World, one of the state-of-the-art ANNS methods, to analyze the effects of data deletion, and propose Deletion Control, a method which dynamically selects the appropriate deletion method under a required search accuracy.
Dynamic Algorithm for Explainable k-medians Clustering under lp Norm
Makarychev, Konstantin, Papanikolaou, Ilias, Shan, Liren
We study the problem of explainable k-medians clustering introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (2020). In this problem, the goal is to construct a threshold decision tree that partitions data into k clusters while minimizing the k-medians objective. These trees are interpretable because each internal node makes a simple decision by thresholding a single feature, allowing users to trace and understand how each point is assigned to a cluster. We present the first algorithm for explainable k-medians under lp norm for every finite p >= 1. Our algorithm achieves an O(p(log k)^{1 + 1/p - 1/p^2}) approximation to the optimal k-medians cost for any p >= 1. Previously, algorithms were known only for p = 1 and p = 2. For p = 2, our algorithm improves upon the existing bound of O(log^{3/2}k), and for p = 1, it matches the tight bound of log k + O(1) up to a multiplicative O(log log k) factor. We show how to implement our algorithm in a dynamic setting. The dynamic algorithm maintains an explainable clustering under a sequence of insertions and deletions, with amortized update time O(d log^3 k) and O(log k) recourse, making it suitable for large-scale and evolving datasets.
- Workflow (0.46)
- Research Report (0.40)
Forgetting by Pruning: Data Deletion in Join Cardinality Estimation
He, Chaowei, Liu, Yuanjun, Ma, Qingzhi, Ren, Shenyuan, Luo, Xizhao, Zhao, Lei, Liu, An
Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning, faces three critical challenges in learned CE models: attribute-level sensitivity, inter-table propagation and domain disappearance leading to severe overestimation in multi-way joins. We propose Cardinality Estimation Pruning (CEP), the first unlearning framework specifically designed for multi-table learned CE systems. CEP introduces Distribution Sensitivity Pruning, which constructs semi-join deletion results and computes sensitivity scores to guide parameter pruning, and Domain Pruning, which removes support for value domains entirely eliminated by deletion. We evaluate CEP on state-of-the-art architectures NeuroCard and FACE across IMDB and TPC-H datasets. Results demonstrate CEP consistently achieves the lowest Q-error in multi-table scenarios, particularly under high deletion ratios, often outperforming full retraining. Furthermore, CEP significantly reduces convergence iterations, incurring negligible computational overhead of 0.3%-2.5% of fine-tuning time.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California (0.04)
- Africa > Guinea > Kankan Region > Kankan Prefecture > Kankan (0.04)
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
- Information Technology > Databases (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Security & Privacy (0.93)
A Diffusion Model to Shrink Proteins While Maintaining Their Function
Baron, Ethan, Amin, Alan N., Weitzman, Ruben, Marks, Debora, Wilson, Andrew Gordon
Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.
- Research Report (0.82)
- Instructional Material (0.54)