Query Processing
A General Characterization of the Statistical Query Complexity
Statistical query (SQ) algorithms are algorithms that have access to an {\em SQ oracle} for the input distribution $D$ instead of i.i.d.~ samples from $D$. Given a query function $\phi:X \rightarrow [-1,1]$, the oracle returns an estimate of ${\bf E}_{ x\sim D}[\phi(x)]$ within some tolerance $\tau_\phi$ that roughly corresponds to the number of samples. In this work we demonstrate that the complexity of solving general problems over distributions using SQ algorithms can be captured by a relatively simple notion of statistical dimension that we introduce. SQ algorithms capture a broad spectrum of algorithmic approaches used in theory and practice, most notably, convex optimization techniques. Hence our statistical dimension allows to investigate the power of a variety of algorithmic approaches by analyzing a single linear-algebraic parameter. Such characterizations were investigated over the past 20 years in learning theory but prior characterizations are restricted to the much simpler setting of classification problems relative to a fixed distribution on the domain (Blum et al., 1994; Bshouty and Feldman, 2002; Yang, 2001; Simon, 2007; Feldman, 2012; Szorenyi, 2009). Our characterization is also the first to precisely characterize the necessary tolerance of queries. We give applications of our techniques to two open problems in learning theory and to algorithms that are subject to memory and communication constraints.
Introduction to Formal Concept Analysis and Its Applications in Information Retrieval and Related Fields
This paper is a tutorial on Formal Concept Analysis (FCA) and its applications. FCA is an applied branch of Lattice Theory, a mathematical discipline which enables formalisation of concepts as basic units of human thinking and analysing data in the object-attribute form. Originated in early 80s, during the last three decades, it became a popular human-centred tool for knowledge representation and data analysis with numerous applications. Since the tutorial was specially prepared for RuSSIR 2014, the covered FCA topics include Information Retrieval with a focus on visualisation aspects, Machine Learning, Data Mining and Knowledge Discovery, Text Mining and several others.
A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution
Mazumdar, Arya (University of Massachusetts Amherst) | Saha, Barna (University of Massachusetts Amherst)
Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Due to inherent ambiguity of data representation and poor data quality, ER is a challenging task for any automated process. As a remedy, human-powered ER via crowdsourcing has become popular in recent years. Using crowd to answer queries is costly and time consuming. Furthermore, crowd-answers can often be faulty. Therefore, crowd-based ER methods aim to minimize human participation without sacrificing the quality and use a computer generated similarity matrix actively. While, some of these methods perform well in practice, no theoretical analysis exists for them, and further their worst case performances do not reflect the experimental findings. This creates a disparity in the understanding of the popular heuristics for this problem. In this paper, we make the first attempt to close this gap. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based ER. We justify experimental observations with our analysis and information theoretic lower bounds.
Query Complexity of Tournament Solutions
Dey, Palash (Indian Institute of Science, Bangalore)
A directed graph where there is exactly one edge between every pair of vertices is called a tournament. Finding the “best” set of vertices of a tournament is a well studied problem in social choice theory. A tournament solution takes a tournament as input and outputs a subset of vertices of the input tournament. However, in many applications, for example, choosing the best set of drugs from a given set of drugs, the edges of the tournament are given only implicitly and knowing the orientation of an edge is costly. In such scenarios, we would like to know the best set of vertices (according to some tournament solution) by “querying” as few edges as possible. We, in this paper, precisely study this problem for commonly used tournament solutions: given an oracle access to the edges of a tournament T , find f(T) by querying as few edges as possible, for a tournament solution f. We first show that the set of Condorcet non-losers in a tournament can be found by querying 2n−⌊log n⌋−2 edges only and this is tight in the sense that every algorithm for finding the set of Condorcet non-losers needs to query at least 2n−⌊log n⌋−2 edges in the worst case, where n is the number of vertices in the input tournament. We then move on to study other popular tournament solutions and show that any algorithm for finding the Copeland set, the Slater set, the Markov set, the bipartisan set, the uncovered set, the Banks set, and the top cycle must query Ω(n 2 ) edges in the worst case. On the positive side, we are able to circumvent our strong query complexity lower bound results by proving that, if the size of the top cycle of the input tournament is at most k, then we can find all the tournament solutions mentioned above by querying O(nk + n log n / log(1− 1 / k ) ) edges only.
Demand-Driven Incremental Object Queries
Liu, Yanhong A., Brandvein, Jon, Stoller, Scott D., Lin, Bo
Object queries are essential in information seeking and decision making in vast areas of applications. However, a query may involve complex conditions on objects and sets, which can be arbitrarily nested and aliased. The objects and sets involved as well as the demand---the given parameter values of interest---can change arbitrarily. How to implement object queries efficiently under all possible updates, and furthermore to provide complexity guarantees? This paper describes an automatic method. The method allows powerful queries to be written completely declaratively. It transforms demand as well as all objects and sets into relations. Most importantly, it defines invariants for not only the query results, but also all auxiliary values about the objects and sets involved, including those for propagating demand, and incrementally maintains all of them. Implementation and experiments with problems from a variety of application areas, including distributed algorithms and probabilistic queries, confirm the analyzed complexities, trade-offs, and significant improvements over prior work.
Hadoop vs Teradata - PHP Hadoop Articles
Hadoop, therefore, doesn't have what it requires to be considered a data warehouse. Obviously, Hadoop vendors will probably be working more difficult to improve security of information access, restrict permissions, and address a broader array of data protection issues. The two major goals of the initiative should happen to increase performance and provide a rich series of SQL features like analytic functions, query optimization, and standard data types including timestamp etc.. An increasing community of Hadoop vendors provide a byzantine selection of solutions. The opportunity would be to monetise huge levels of data using tools which weren't previously offered.
Microsoft announces general availability of Azure SQL Data Warehouse - MSPoweruser
Microsoft today announced the general availability of the Azure SQL Data Warehouse, an elastic data warehouse as a service with enterprise-class features. It is a fully managed DW as a Service that you can provision in minutes and scale up to 60 times larger in seconds. With Azure SQL Data Warehouse, storage and compute scale independently. You can dynamically deploy, grow, shrink, and even pause compute, taking advantage of best-in-class price/performance. Also, SQL Data Warehouse uses the power and familiarity of T-SQL to let you easily integrate query results across relational data in your data warehouse and non-relational data in Azure blob storage.
Dissociation and Propagation for Approximate Lifted Inference with Standard Relational Database Management Systems
Gatterbauer, Wolfgang, Suciu, Dan
Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.
Drill Data with Apache Drill
Apache Drill is a low-latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data. Inspired by Google's Dremel, Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require. Apache Drill includes a distributed execution environment, purpose built for large-scale data processing. At the core of Apache Drill is the "Drillbit" service which is responsible for accepting requests from the client, processing the queries, and returning results to the client. When a Drillbit runs on each data node in a cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes.