Goto

Collaborating Authors

 Information Retrieval


A Survey of Data Mining Techniques for Social Media Analysis

arXiv.org Artificial Intelligence

Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors.


Natural Language Access to Enterprise Data

AI Magazine

This paper describes USI Answers — a natural language question answering system for enterprise data. We report on the progress towards the goal of offering easy access to enterprise data to a large number of business users, most of whom are not familiar with the specific syntax or semantics of the underlying data sources. Additional complications come from the nature of the data, which comes both as structured and unstructured. The proposed solution allows users to express questions in natural language, makes apparent the system's interpretation of the query, and allows easy query adjustment and reformulation. The application is in use by more than 1500 users from Siemens Energy. We evaluate our approach on a data set consisting of fleet data.


Toward computational cumulative biology by combining models of biological datasets

arXiv.org Machine Learning

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to both include biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer and the model-based search was more accurate than keyword search; it moreover recovered biologically meaningful relationships that are not straightforwardly visible from annotations, for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.


Counterfactual Estimation and Optimization of Click Metrics for Search Engines

arXiv.org Machine Learning

Optimizing an interactive system against a predefined online metric is particularly challenging, when the metric is computed from user feedback such as clicks and payments. The key challenge is the counterfactual nature: in the case of Web search, any change to a component of the search engine may result in a different search result page for the same query, but we normally cannot infer reliably from search log how users would react to the new result page. Consequently, it appears impossible to accurately estimate online metrics that depend on user feedback, unless the new engine is run to serve users and compared with a baseline in an A/B test. This approach, while valid and successful, is unfortunately expensive and time-consuming. In this paper, we propose to address this problem using causal inference techniques, under the contextual-bandit framework. This approach effectively allows one to run (potentially infinitely) many A/B tests offline from search log, making it possible to estimate and optimize online metrics quickly and inexpensively. Focusing on an important component in a commercial search engine, we show how these ideas can be instantiated and applied, and obtain very promising results that suggest the wide applicability of these techniques.


Pareto-depth for Multiple-query Image Retrieval

arXiv.org Machine Learning

Most content-based image retrieval systems consider either one single query, or multiple queries that include the same object or represent the same semantic information. In this paper we consider the content-based image retrieval problem for multiple query images corresponding to different image semantics. We propose a novel multiple-query information retrieval algorithm that combines the Pareto front method (PFM) with efficient manifold ranking (EMR). We show that our proposed algorithm outperforms state of the art multiple-query retrieval algorithms on real-world image databases. We attribute this performance improvement to concavity properties of the Pareto fronts, and prove a theoretical result that characterizes the asymptotic concavity of the fronts.


Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Neural Information Processing Systems

We go beyond the notion of pairwise similarity and look into search problems with $k$-way similarity functions. In this paper, we focus on problems related to \emph{3-way Jaccard} similarity: $\mathcal{R}^{3way}= \frac{|S_1 \cap S_2 \cap S_3|}{|S_1 \cup S_2 \cup S_3|}$, $S_1, S_2, S_3 \in \mathcal{C}$, where $\mathcal{C}$ is a size $n$ collection of sets (or binary vectors). We show that approximate $\mathcal{R}^{3way}$ similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to $k$-way resemblance. In the process, we extend traditional framework of \emph{locality sensitive hashing (LSH)} to handle higher order similarities, which could be of independent theoretical interest. The applicability of $\mathcal{R}^{3way}$ search is shown on the Google sets" application. In addition, we demonstrate the advantage of $\mathcal{R}^{3way}$ resemblance over the pairwise case in improving retrieval quality."


Which Space Partitioning Tree to Use for Search?

Neural Information Processing Systems

We consider the task of nearest-neighbor search with the class of binary-space-partitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question which tree to use for nearest-neighbor search?'' To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance -- margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve the search performance of a space-partitioning tree. "


ARTigo: Building an Artwork Search Engine With Games and Higher-Order Latent Semantic Analysis

AAAI Conferences

This article describes how a semantic search engine has been build from, and still is continuously improved by, a semantic analysis of the “footprints” left by players on the gaming Web platform ARTigo. The Web platform offers several Games With a Purpose (GWAPs) some of which have been specifically designed to collect the data needed for building the artwork search engine. ARTigo is a “tagging ecosystem” of games that cooperate so as to gather a wide range of information on artworks. The ARTigo ecosystem generates a folksonomy saved as 3rd-order tensor, that is a generalization of a matrix, the three orders or dimensions of which represent (1) who (2) tagged an (3) an artwork. The semantic search engine is build using a non-trivial generalization of the well-known, matrix-based, Latent Semantic Analysis (LSA) methods and algorithms. ARTigo is in service for five years and is subject to an active research constantly resulting in new developments, some of which are reported about for the first time in this article.


A Framework for Adaptive Crowd Query Processing

AAAI Conferences

Search engines can yield poor results for information retrieval tasks when they cannot interpret query predicates. Such predicates are better left for humans to evaluate. We propose an adaptive processing framework for deciding (a) which parts of a query should be processed by machines and (b) the order the crowd should process the remaining parts, optimizing for result quality and processing cost. We describe an algorithm and experimental results for the first framework component.


DataSift: An Expressive and Accurate Crowd-Powered Search Toolkit

AAAI Conferences

Traditional information retrieval systems have limited functionality. For instance, they are not able to adequately support queries containing non-textual fragments such as images or videos, queries that are very long or ambiguous, or semantically-rich queries over non-textual corpora. In this paper, we present DataSift, an expressive and accurate crowd-powered search toolkit that can connect to any corpus. We provide a number of alternative configurations for DataSift using crowdsourced and automated components, and demonstrate gains of 2–3x on precision over traditional retrieval schemes using experiments on real corpora. We also present our results on determining suitable values for parameters in those configurations, along with a number of interesting insights learned along the way.