Goto

Collaborating Authors

 edit distance


Clustering Billions of Reads for DNA Data Storage

Neural Information Processing Systems

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters.






GREED: A Neural Framework for Learning Graph Distance Functions

Neural Information Processing Systems

Similarity search in graph databases is one of the most fundamental operations in graph analytics. Among various distance functions, graph and subgraph edit distances (GED and SED respectively) are two of the most popular and expressive measures. Unfortunately, exact computations for both are NP-hard. To overcome this computational bottleneck, neural approaches to learn and predict edit distance in polynomial time have received much interest. While considerable progress has been made, there exist limitations that need to be addressed.


Evolutionary Architecture Search through Grammar-Based Sequence Alignment

Martín, Adri Gómez, Möller, Felix, McDonagh, Steven, Abella, Monica, Desco, Manuel, Crowley, Elliot J., Klein, Aaron, Ericsson, Linus

arXiv.org Artificial Intelligence

Neural architecture search (NAS) in expressive search spaces is a computationally hard problem, but it also holds the potential to automatically discover completely novel and performant architectures. To achieve this we need effective search algorithms that can identify powerful components and reuse them in new candidate architectures. In this paper, we introduce two adapted variants of the Smith-Waterman algorithm for local sequence alignment and use them to compute the edit distance in a grammar-based evolutionary architecture search. These algorithms enable us to efficiently calculate a distance metric for neural architectures and to generate a set of hybrid offspring from two parent models. This facilitates the deployment of crossover-based search heuristics, allows us to perform a thorough analysis on the architectural loss landscape, and track population diversity during search. We highlight how our method vastly improves computational complexity over previous work and enables us to efficiently compute shortest paths between architectures. When instantiating the crossover in evolutionary searches, we achieve competitive results, outperforming competing methods. Future work can build upon this new tool, discovering novel components that can be used more broadly across neural architecture design, and broadening its applications beyond NAS.


Feature Ranking in Credit-Risk with Qudit-Based Networks

Maragkopoulos, Georgios, Chavatzoglou, Lazaros, Mandilara, Aikaterini, Syvridis, Dimitris

arXiv.org Artificial Intelligence

In finance, predictive models must balance accuracy and interpretability, particularly in credit risk assessment, where model decisions carry material consequences. We present a quantum neural network (QNN) based on a single qudit, in which both data features and trainable parameters are co-encoded within a unified unitary evolution generated by the full Lie algebra. This design explores the entire Hilbert space while enabling interpretability through the magnitudes of the learned coefficients. We benchmark our model on a real-world, imbalanced credit-risk dataset from Taiwan. The proposed QNN consistently outperforms LR and reaches the results of random forest models in macro-F1 score while preserving a transparent correspondence between learned parameters and input feature importance. To quantify the interpretability of the proposed model, we introduce two complementary metrics: (i) the edit distance between the model's feature ranking and that of LR, and (ii) a feature-poisoning test where selected features are replaced with noise. Results indicate that the proposed quantum model achieves competitive performance while offering a tractable path toward interpretable quantum learning.


Clustering Billions of Reads for DNA Data Storage

Neural Information Processing Systems

Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters.