Query Processing
Few-Shot Preference Learning for Human-in-the-Loop RL
While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries
Jaiswal, Shikhar, Krishnaswamy, Ravishankar, Garg, Ankit, Simhadri, Harsha Vardhan, Agrawal, Sheshansh
Since solving State-of-the-art algorithms for Approximate Nearest Neighbor Search the problem exactly requires an expensive exhaustive scan of the (ANNS) such as DiskANN, FAISS-IVF, and HNSW build data dependent database - which would be impractical for real-world indices that indices that offer substantially better accuracy and search span billions of objects - practical interactive search systems use efficiency over data-agnostic indices by overfitting to the index Approximate Nearest Neighbor Search (ANNS) algorithms with data distribution. When the query data is drawn from a different highly sub-linear query complexity [10, 18, 24, 30] to answer such distribution - e.g., when index represents image embeddings and queries. The quality of such ANN indices is often measured by query represents textual embeddings - such algorithms lose much k-recall@k which is the overlap between the top-results of the of this performance advantage. On a variety of datasets, for a fixed index search with the ground truth -nearest neighbors (-NNs) in recall target, latency is worse by an order of magnitude or more for the corpus for the query, averaged over a representative query set. Out-Of-Distribution (OOD) queries as compared to In-Distribution State-of-the-art algorithms for ANNS, such as graph-based indices (ID) queries. The question we address in this work is whether ANNS [16, 24, 30] which use data-dependent index construction, algorithms can be made efficient for OOD queries if the index construction achieve better query efficiency over prior data-agnostic methods is given access to a small sample set of these queries. We like LSH [6, 18] (see Section A.1 for more details). Such efficiency answer positively by presenting OOD-DiskANN, which uses a sparing enables these indices to serve queries with > 90% recall with a sample (1% of index set size) of OOD queries, and provides up to latency of a few milliseconds, required in interactive web scenarios.
DIAMETRICS
This paper introduces DIAMETRICS: a novel framework for end-to-end benchmarking and performance monitoring of query engines. DIAMETRICS consists of a number of components supporting tasks such as automated workload summarization, data anonymization, benchmark execution, monitoring, regression identification, and alerting. The architecture of DIAMETRICS is highly modular and supports multiple systems by abstracting their implementation details and relying on common canonical formats and pluggable software drivers. The end result is a powerful unified framework that is capable of supporting every aspect of benchmarking production systems and workloads. DIAMETRICS has been developed in Google and is being used to benchmark various internal query engines. In this paper, we give an overview of DIAMETRICS and discuss its design and implementation. Furthermore, we provide details about its deployment and example use cases. Given the variety of supported systems and use cases within Google, we argue that its core concepts can be used more widely to enable comparative end-to-end benchmarking in other industrial environments. The data management landscape has drastically changed over the last few years. The majority of database systems are no longer manually tuned and optimized for a specific application by well-versed administrators; instead, they are designed to support a variety of applications. To support all of these applications, a multitude of data models, storage formats, and query engines have transformed the data management landscape from standalone, specialized deployments to entire ecosystems.
On Efficient Approximate Queries over Machine Learning Models
Ding, Dujian, Amer-Yahia, Sihem, Lakshmanan, Laks VS
The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval
Fang, Xiang, Liu, Daizong, Zhou, Pan, Hu, Yuchong
As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (``source domain''), but the domain of interest (``target domain'') only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations.
Understanding the Snowflake Query Optimizer
You are preeminent in your field, a singular talent. Using cleverness and craft you imagine factory designs that are elegant and streamlined. No inch of space wasted, not an inefficiency in sight. Imagine you want to design a megafactory - a factory that designs factories - encapsulating everything you know into one automated machine. You embark on an impossible journey.
Zebra: Deeply Integrating System-Level Provenance Search and Tracking for Efficient Attack Investigation
Yang, Xinyu, Liu, Haoyuan, Wang, Ziyu, Gao, Peng
However, a key limitation is that their DSLs can only search for events that are located within a close subgraph neighborhood. System auditing has emerged as a key approach for monitoring Thus, these approaches cannot efficiently reveal faraway system call events and investigating sophisticated attacks. Based on events on a long-range attack sequence, which is observed in many the collected audit logs, research has proposed to search for attack of the attacks these days due to their sophisticated, multi-stage patterns or track the causal dependencies of system events to reveal nature [55]. Tracking-based approaches assume causal dependencies the attack sequence. However, existing approaches either cannot between system entities that are involved in the same system reveal long-range attack sequences or suffer from the dependency event (e.g., a process reading a file) [45, 48, 52, 54]. Based on this explosion problem due to a lack of focus on attack-relevant parts, assumption, these approaches track the dependencies from a Point and thus are insufficient for investigating complex attacks. of Interest (POI) event (e.g., an alert event like the creation of a To bridge the gap, we propose Zebra, a system that synergistically suspicious file) and construct a system dependency graph, in which integrates attack pattern search and causal dependency tracking nodes represent system entities and edges represent system events.
DyREx: Dynamic Query Representation for Extractive Question Answering
Zaratiana, Urchade, Khbir, Niama El, Nรบรฑez, Dennis, Holat, Pierre, Tomeh, Nadi, Charnois, Thierry
Extractive question answering (ExQA) is an essential task for Natural Language Processing. The dominant approach to ExQA is one that represents the input sequence tokens (question and passage) with a pre-trained transformer, then uses two learned query vectors to compute distributions over the start and end answer span positions. These query vectors lack the context of the inputs, which can be a bottleneck for the model performance. To address this problem, we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach where we dynamically compute query vectors given the input, using an attention mechanism through transformer layers. Empirical observations demonstrate that our approach consistently improves the performance over the standard one. The code and accompanying files for running the experiments are available at \url{https://github.com/urchade/DyReX}.
Query Expansion Using Contextual Clue Sampling with Language Models
Liu, Linqing, Li, Minghan, Lin, Jimmy, Riedel, Sebastian, Stenetorp, Pontus
Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts for expansion. Along this line, we argue that expansion terms from these contexts should balance two key aspects: diversity and relevance. The obvious way to increase diversity is to sample multiple contexts from the language model. However, this comes at the cost of relevance, because there is a well-known tendency of models to hallucinate incorrect or irrelevant contexts. To balance these two considerations, we propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context. Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR, while reducing the index size by more than 96%. For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
Zeroth-Order Negative Curvature Finding: Escaping Saddle Points without Gradients
Zhang, Hualin, Xiong, Huan, Gu, Bin
We consider escaping saddle points of nonconvex problems where only the function evaluations can be accessed. Although a variety of works have been proposed, the majority of them require either second or first-order information, and only a few of them have exploited zeroth-order methods, particularly the technique of negative curvature finding with zeroth-order methods which has been proven to be the most efficient method for escaping saddle points. To fill this gap, in this paper, we propose two zeroth-order negative curvature finding frameworks that can replace Hessian-vector product computations without increasing the iteration complexity. We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER and prove that these ZO algorithms can converge to $(\epsilon,\delta)$-approximate second-order stationary points with less query complexity compared with prior zeroth-order works for finding local minima.