Discovering Data Structures: Nearest Neighbor Search and Beyond

Salemohamed, Omar, Charlin, Laurent, Garg, Shivam, Sharan, Vatsal, Valiant, Gregory

arXiv.org Artificial Intelligence 

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, elements of locality-sensitive hashing emerge. Additionally, the model learns useful representations of high-dimensional data and exploits them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could be a powerful discovery tool for new problems. Can deep learning models be trained to discover data structures from scratch? There are several motivations for this question. Deep learning models are increasingly performing tasks once considered exclusive to humans, from image recognition and mastering the game of Go to engaging in natural language conversations. Designing data structures and algorithms, along with solving complex math problems, are particularly challenging tasks. They require searching through a vast combinatorial space with a difficult to define structure. It is therefore natural to ask what it would take for deep learning models to solve such problems. There are already promising signs: these models have discovered fast matrix-multiplication algorithms (Fawzi et al., 2022), solved SA T problems (Selsam et al., 2018), and learned optimization algorithms for various learning tasks (Garg et al., 2022; Aky urek et al., 2022; Fu et al., 2023; V on Oswald et al., 2023). In this work, we investigate the problem of data structure discovery, with a focus on nearest neighbor search. The second motivation is practical. Data structures are ubiquitous objects that enable efficient querying. Traditionally, they have been designed to be worst-case optimal and therefore agnostic to the underlying data and query distributions.