Goto

Collaborating Authors

 Information Retrieval


Prediction of Horizontal Data Partitioning Through Query Execution Cost Estimation

arXiv.org Machine Learning

The excessively increased volume of data in modern data management systems demands an improved system performance, frequently provided by data distribution, system scalability and performance optimization techniques. Optimized horizontal data partitioning has a significant influence of distributed data management systems. An optimally partitioned schema found in the early phase of logical database design without loading of real data in the system and its adaptation to changes of business environment are very important for a successful implementation, system scalability and performance improvement. In this paper we present a novel approach for finding an optimal horizontally partitioned schema that manifests a minimal total execution cost of a given database workload. Our approach is based on a formal model that enables abstraction of the predicates in the workload queries, and are subsequently used to define all relational fragments. This approach has predictive features acquired by simulation of horizontal partitioning, without loading any data into the partitions, but instead, altering the statistics in the database catalogs. We define an optimization problem and employ a genetic algorithm (GA) to find an approximately optimal horizontally partitioned schema. The solutions to the optimization problem are evaluated using PostgreSQL's query optimizer. The initial experimental evaluation of our approach confirms its efficiency and correctness, and the numbers imply that the approach is effective in reducing the workload execution cost.


Join Query Optimization with Deep Reinforcement Learning Algorithms

arXiv.org Artificial Intelligence

Join query optimization is a complex task and is central to the performance of query processing. In fact it belongs to the class of NP-hard problems. Traditional query optimizers use dynamic programming (DP) methods combined with a set of rules and restrictions to avoid exhaustive enumeration of all possible join orders. However, DP methods are very resource intensive. Moreover, given simplifying assumptions of attribute independence, traditional query optimizers rely on erroneous cost estimations, which can lead to suboptimal query plans. Recent success of deep reinforcement learning (DRL) creates new opportunities for the field of query optimization to tackle the above-mentioned problems. In this paper, we present our DRL-based Fully Observed Optimizer (FOOP) which is a generic query optimization framework that enables plugging in different machine learning algorithms. The main idea of FOOP is to use a data-adaptive learning query optimizer that avoids exhaustive enumerations of join orders and is thus significantly faster than traditional approaches based on dynamic programming. In particular, we evaluate various DRL-algorithms and show that Proximal Policy Optimization significantly outperforms Q-learning based algorithms. Finally we demonstrate how ensemble learning techniques combined with DRL can further improve the query optimizer.


Releasing a new benchmark and data set for evaluating neural code search models

#artificialintelligence

A new benchmark to evaluate code search techniques. The benchmark includes the largest evaluation data set currently available for Java, consisting of a natural language query and code snippet pairs. This data set comprises 287 Stack Overflow question-and-answer pairs from the Stack Exchange Data Dump. Also included is a search corpus that contains more than 24,000 of the most popular Android repositories on GitHub (ranked by the number of stars) and is indexed using the more than 4.7 million method bodies parsed from these repositories. A score sheet on the evaluation data set, using two models from our recent work, is also included.


Large expert-curated database for benchmarking document similarity detection in biomedical literature search

#artificialintelligence

Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations.


Google's supremacy from information retrieval

#artificialintelligence

TITOLO: Google's supremacy from information retrieval, passing through Natural Language Modeling, to quantum computing ABSTRACT: The current era of data is dominated by a few big players. Among them, Google is playing the role of a significant technological innovator. Its story started from Information Retrieval algorithms is evolving in many fields of research, including natural language processing, user modeling, and quantum computers. The talk will discuss some of the algorithms proposed by google as state of the art in the last years for performing innovative tasks in machine learning, NLP, and more from a research point of view applied in everyday tasks. RELATORE: Marco Polignano BIO RELATORE: Marco Polignano is a PostDoc Research Fellow at the Department of Computer Science, University of Bari Aldo Moro, Italy, in the SWAP (Semantic Web Access and Personalization) research group.


A deep dive into BERT: How BERT launched a rocket into natural language understanding - Search Engine Land

#artificialintelligence

Editor's Note: This deep dive companion to our high-level FAQ piece is a 30-minute read so get comfortable! You'll learn the backstory and nuances of BERT's evolution, how the algorithm works to improve human language understanding for machines and what it means for SEO and the work we do every day. If you have been keeping an eye on Twitter SEO over the past week you'll have likely noticed an uptick in the number of gifs and images featuring the character Bert (and sometimes Ernie) from Sesame Street. This is because, last week Google announced an imminent algorithmic update would be rolling out, impacting 10% of queries in search results, and also affect featured snippet results in countries where they were present; which is not trivial. The update is named Google BERT (Hence the Sesame Street connection – and the gifs). Google describes BERT as the largest change to its search system since the company introduced RankBrain, almost five years ago, and probably one of the largest changes in search ever. The news of BERT's arrival and its impending impact has caused a stir in the SEO community, along with some confusion as to what BERT does, and what it means for the industry overall. With this in mind, let's take a look at what BERT is, BERT's background, the need for BERT and the challenges it aims to resolve, the current situation (i.e. The BERT backstory How search engines learn language Problems with language learning methods How BERT improves search engine language understanding What does BERT mean for SEO? BERT is a technologically ground-breaking natural language processing model/framework which has taken the machine learning world by storm since its release as an academic research paper. The research paper is entitled BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2018). Following paper publication Google AI Research team announced BERT as an open source contribution. A year later, Google announced a Google BERT algorithmic update rolling out in production search. Google linked the BERT algorithmic update to the BERT research paper, emphasizing BERT's importance for contextual language understanding in content and queries, and therefore intent, particularly for conversational search. BERT is described as a pre-trained deep learning natural language framework that has given state-of-the-art results on a wide variety of natural language processing tasks. Whilst in the research stages, and prior to being added to production search systems, BERT achieved state-of-the-art results on 11 different natural language processing tasks. These natural language processing tasks include, amongst others, sentiment analysis, named entity determination, textual entailment (aka next sentence prediction), semantic role labeling, text classification and coreference resolution. BERT also helps with the disambiguation of words with multiple meanings known as polysemous words, in context.


Analysis of Evolutionary Behavior in Self-Learning Media Search Engines

arXiv.org Artificial Intelligence

The diversity of intrinsic qualities of multimedia entities tends to impede their effective retrieval. In a SelfLearning Search Engine architecture, the subtle nuances of human perceptions and deep knowledge are taught and captured through unsupervised reinforcement learning, where the degree of reinforcement may be suitably calibrated. Such architectural paradigm enables indexes to evolve naturally while accommodating the dynamic changes of user interests. It operates by continuously constructing indexes over time, while injecting progressive improvement in search performance. For search operations to be effective, convergence of index learning is of crucial importance to ensure efficiency and robustness. In this paper, we develop a Self-Learning Search Engine architecture based on reinforcement learning using a Markov Decision Process framework. The balance between exploration and exploitation is achieved through evolutionary exploration Strategies. The evolutionary index learning behavior is then studied and formulated using stochastic analysis. Experimental results are presented which corroborate the steady convergence of the index evolution mechanism. Index Term


Corpus-Level End-to-End Exploration for Interactive Systems

arXiv.org Artificial Intelligence

A core interest in building Artificial Intelligence (AI) agents is to let them interact with and assist humans. One example is Dynamic Search (DS), which models the process that a human works with a search engine agent to accomplish a complex and goal-oriented task. Early DS agents using Reinforcement Learning (RL) have only achieved limited success for (1) their lack of direct control over which documents to return and (2) the difficulty to recover from wrong search trajectories. In this paper, we present a novel corpus-level end-to-end exploration (CE3) method to address these issues. In our method, an entire text corpus is compressed into a global low-dimensional representation, which enables the agent to gain access to the full state and action spaces, including the under-explored areas. We also propose a new form of retrieval function, whose linear approximation allows end-to-end manipulation of documents. Experiments on the Text REtrieval Conference (TREC) Dynamic Domain (DD) Track show that CE3 outperforms the state-of-the-art DS systems.


Google Provides New Information on Latest Video Structured Data Features - Search Engine Journal

#artificialintelligence

Google has updated its video structured data help document with new information on some of the latest enhancements to video search results. The updated document has details on how to mark timestamps on YouTube videos, how to monitor performance in Search Console, and more screenshots to show the new structured data types in action. The Google video structured data doc got a big refresh – More screenshots! Here's a rundown of the new information that was added. Google may display timestamps alongside YouTube videos in search results which help searchers jump directly to a specific part of the video.


Google Search Console to Report on Data Related to Product Rich Results - Search Engine Journal

#artificialintelligence

Google is adding new data to Search Console giving retailers insight into the performance of product rich results in Google Search. The data can be found in a new Search Appearance in the Search Console performance report, which captures stats such as total clicks, impressions, average click-through rate, and average position. Websites that are eligible to appear in Google's product search results will find a new Search Appearance type called "Product results." "Website owners need to understand the impact of these rich results. The Google Search Console Performance report provides key metrics like clicks and impressions to help webmasters understand and optimize the performance of their website results on Google Search. These metrics can further be segmented by device, geography and queries."