Goto

Collaborating Authors

 Information Retrieval



Bing is the anti-AI search engine you should be using

PCWorld

PCWorld argues that Bing serves as a superior alternative to AI-heavy search engines by prioritizing human-authored content over automated summaries. AI search engines like Google's AI Mode often hide original sources and provide misleading information, with traffic to publishers dropping significantly.



An algorithm for L1 nearest neighbor search via monotonic embedding

Neural Information Processing Systems

Fast algorithms for nearest neighbor (NN) search have in large part focused on 2 distance. Here we develop an approach for 1 distance that begins with an explicit and exactly distance-preserving embedding of the points into 22. We show how this can efficiently be combined with random-projection based methods for 2 NN search, such as locality-sensitive hashing (LSH) or random projection trees. We rigorously establish the correctness of the methodology and show by experimentation using LSH that it is competitive in practice with available alternatives.


Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits

Neural Information Processing Systems

There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of (approximately) navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to the given distance function. The complete graph is obviously navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree $O(\sqrt{n \log n })$ for any set of $n$ points, in any dimension, for any distance function. We compliment this result with a nearly matching lower bound: even under the Euclidean metric in $O(\log n)$ dimensions, a random point set has no navigable graph with average degree $O(n^{\alpha})$ for any $\alpha < 1/2$. Our lower bound relies on sharp anti-concentration bounds for binomial random variables, which we use to show that the {near-neighborhoods} of a set of random points do not overlap significantly, forcing any navigable graph to have many edges.


TSDS: Data Selection for Task-Specific Model Finetuning

Neural Information Processing Systems

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data.We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques.We evaluate our method on data selection for both continued pretraining and instruction tuning of language models.We show that instruction tuning using data selected by our method with a 1\% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.


An ensemble diversity approach to supervised binary hashing

Neural Information Processing Systems

Binary hashing is a well-known approach for fast approximate nearest-neighbor search in information retrieval. Much work has focused on affinity-based objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves state-of-the-art precision and recall in experiments with image retrieval.


Query Complexity of Bayesian Private Learning

Neural Information Processing Systems

We study the query complexity of Bayesian Private Learning: a learner wishes to locate a random target within an interval by submitting queries, in the presence of an adversary who observes all of her queries but not the responses. How many queries are necessary and sufficient in order for the learner to accurately estimate the target, while simultaneously concealing the target from the adversary? Our main result is a query complexity lower bound that is tight up to the first order. We show that if the learner wants to estimate the target within an error of $\epsilon$, while ensuring that no adversary estimator can achieve a constant additive error with probability greater than $1/L$, then the query complexity is on the order of $L\log(1/\epsilon)$ as $\epsilon \to 0$. Our result demonstrates that increased privacy, as captured by $L$, comes at the expense of a \emph{multiplicative} increase in query complexity. The proof builds on Fano's inequality and properties of certain proportional-sampling estimators.


Meta-Learning MCMC Proposals

Neural Information Processing Systems

Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling.


Self-Retrieval: End-to-End InformationRetrieval withOneLargeLanguageModel

Neural Information Processing Systems

The rise of large language models (LLMs) has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them. In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture.