Query Processing
Azure Synapse Analytics Serverless SQL Pool Guidelines
With the introduction of the serverless SQL pool as a part of Azure Synapse Analytics, Microsoft has provided a very cost-efficient and convenient way to drive value from data residing in lakes using simple T-SQL statements. It enables you to easily build logical analytical models by querying and joining data across heterogeneous sources making the development of complex data integration pipelines obsolete in many cases. To use it, you don't even need to explicitly provision it beforehand due to its serverless nature, it is per default part of an Azure Synapse Analytics workspace. All you have to do is query data in an on-demand fashion in which you get charged according to the amount of data your queries need to process. Yet, the flexibility provided in terms of how data can be stored and queried require you to stick to some conventions for properly applying all its features and functionalities. Otherwise, the once promising serverless query engine can end up causing lots of costs together with a poor performance.
GQE-PRF: Generative Query Expansion with Pseudo-Relevance Feedback
Huang, Minghui, Wang, Dong, Liu, Shuang, Ding, Meizhen
Query expansion with pseudo-relevance feedback (PRF) is a powerful approach to enhance the effectiveness in information retrieval. Recently, with the rapid advance of deep learning techniques, neural text generation has achieved promising success in many natural language tasks. To leverage the strength of text generation for information retrieval, in this article, we propose a novel approach which effectively integrates text generation models into PRF-based query expansion. In particular, our approach generates augmented query terms via neural text generation models conditioned on both the initial query and pseudo-relevance feedback. Moreover, in order to train the generative model, we adopt the conditional generative adversarial nets (CGANs) and propose the PRF-CGAN method in which both the generator and the discriminator are conditioned on the pseudo-relevance feedback. We evaluate the performance of our approach on information retrieval tasks using two benchmark datasets. The experimental results show that our approach achieves comparable performance or outperforms traditional query expansion methods on both the retrieval and reranking tasks.
A Survey on Deep Reinforcement Learning for Data Processing and Analytics
Cai, Qingpeng, Cui, Can, Xiong, Yiyuan, Wang, Wei, Xie, Zhongle, Zhang, Meihui
In the age of big data, data processing and analytics are fundamental, ubiquitous, and crucial to many organizations which undertake a digitalization journey to improve and transform their businesses and operations. Data analytics typically entails other key operations such as data acquisition, data cleansing, data integration, modeling, etc., before insights could be extracted. Big data can unleash significant value creation across many sectors such as health care and retail[56]. However, the complexity of data (e.g., high volume, high velocity, and high variety) presents many challenges in data analytics and hence renders the difficulty in drawing meaningful insights. To tackle the challenge and facilitate the data processing and analytics efficiently and effectively, a lot of algorithms and techniques have been designed and numerous learning systems have also been developed by researchers and practitioners such as Spark MLlib[63], and Rafiki[104]. To support fast data processing and accurate data analytics, a huge number of algorithms rely on rules that are developed based on human knowledge and experience. For example, Shortest-job-first is a scheduling algorithm that chooses the job with the smallest execution time for the next execution. However, without fully exploiting characteristics of the workload, it can achieve inferior performance compared to DRL-based scheduling algorithm [58].
Adaptive Multi-Resolution Attention with Linear Complexity
Zhang, Yao, Ma, Yunpu, Seidl, Thomas, Tresp, Volker
Transformers have improved the state-of-the-art across numerous tasks in sequence modeling. Besides the quadratic computational and memory complexity w.r.t the sequence length, the self-attention mechanism only processes information at the same scale, i.e., all attention heads are in the same resolution, resulting in the limited power of the Transformer. To remedy this, we propose a novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space. Specifically, we leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Moreover, to capture the potential relations between query representation and clues of different attention granularities, we leave the decision of which resolution of attention to use to query, which further improves the model's capacity compared to vanilla Transformer. In an effort to reduce complexity, we adopt kernel attention without degrading the performance. Extensive experiments on several benchmarks demonstrate the effectiveness and efficiency of our model by achieving a state-of-the-art performance-efficiency-memory trade-off. To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
Using Query Expansion in Manifold Ranking for Query-Oriented Multi-Document Summarization
Jia, Quanye, Liu, Rui, Lin, Jianying
Manifold ranking has been successfully applied in query-oriented multi-document summarization. It not only makes use of the relationships among the sentences, but also the relationships between the given query and the sentences. However, the information of original query is often insufficient. So we present a query expansion method, which is combined in the manifold ranking to resolve this problem. Our method not only utilizes the information of the query term itself and the knowledge base WordNet to expand it by synonyms, but also uses the information of the document set itself to expand the query in various ways (mean expansion, variance expansion and TextRank expansion). Compared with the previous query expansion methods, our method combines multiple query expansion methods to better represent query information, and at the same time, it makes a useful attempt on manifold ranking. In addition, we use the degree of word overlap and the proximity between words to calculate the similarity between sentences. We performed experiments on the datasets of DUC 2006 and DUC2007, and the evaluation results show that the proposed query expansion method can significantly improve the system performance and make our system comparable to the state-of-the-art systems.
Dremio launches data lake service running on AWS cloud
All the sessions from Transform 2021 are available on-demand now. Dremio today launched a cloud service that creates a data lake based on an in-memory SQL engine that launches queries against data stored in an object-based storage system. The goal is to make it easier for organizations to take advantage of the data lake, dubbed Dremio Cloud, without having to employ an internal IT team to manage it, said Tomer Shiran, chief product officer for Dremio. An organization can now start accessing Dremio Cloud in as little as five minutes, he said. Based on Dremio's existing SQL Lakehouse platform, the Dremio Cloud service runs on the Amazon Web Services (AWS) public cloud.
Multiple Query Optimization using a Hybrid Approach of Classical and Quantum Computing
Fankhauser, Tobias, Solèr, Marc E., Füchslin, Rudolf M., Stockinger, Kurt
Quantum computing promises to solve difficult optimization problems in chemistry, physics and mathematics more efficiently than classical computers, but requires fault-tolerant quantum computers with millions of qubits. To overcome errors introduced by today's quantum computers, hybrid algorithms combining classical and quantum computers are used. In this paper we tackle the multiple query optimization problem (MQO) which is an important NP-hard problem in the area of data-intensive problems. We propose a novel hybrid classical-quantum algorithm to solve the MQO on a gate-based quantum computer. We perform a detailed experimental evaluation of our algorithm and compare its performance against a competing approach that employs a quantum annealer -- another type of quantum computer. Our experimental results demonstrate that our algorithm currently can only handle small problem sizes due to the limited number of qubits available on a gate-based quantum computer compared to a quantum computer based on quantum annealing. However, our algorithm shows a qubit efficiency of close to 99% which is almost a factor of 2 higher compared to the state of the art implementation. Finally, we analyze how our algorithm scales with larger problem sizes and conclude that our approach shows promising results for near-term quantum computers.
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
Lei, Jie, Berg, Tamara L., Bansal, Mohit
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr
A Survey of Knowledge Graph Embedding and Their Applications
Choudhary, Shivani, Luthra, Tarun, Mittal, Ashima, Singh, Rajat
Knowledge Graph embedding provides a versatile technique for representing knowledge. These techniques can be used in a variety of applications such as completion of knowledge graph to predict missing information, recommender systems, question answering, query expansion, etc. The information embedded in Knowledge graph though being structured is challenging to consume in a real-world application. Knowledge graph embedding enables the real-world application to consume information to improve performance. Knowledge graph embedding is an active research area. Most of the embedding methods focus on structure-based information. Recent research has extended the boundary to include text-based information and image-based information in entity embedding. Efforts have been made to enhance the representation with context information. This paper introduces growth in the field of KG embedding from simple translation-based models to enrichment-based models. This paper includes the utility of the Knowledge graph in real-world applications.
Query Embedding on Hyper-relational Knowledge Graphs
Alivanistos, Dimitrios, Berrendorf, Max, Cochez, Michael, Galkin, Mikhail
Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.