Information Retrieval
Scalable Probabilistic Databases with Factor Graphs and MCMC
Wick, Michael, McCallum, Andrew, Miklau, Gerome
Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a distribution over possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary queries over probabilistic databases with arbitrary dependencies expressed by graphical models with structure that changes during inference. MCMC sampling provides efficiency by hypothesizing {\em modifications} to possible worlds rather than generating entire worlds from scratch. Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world. A significant innovation of this work is the connection between MCMC sampling and materialized view maintenance techniques: we find empirically that using view maintenance techniques is several orders of magnitude faster than naively querying each sampled world. We also demonstrate our system's ability to answer relational queries with aggregation, and demonstrate additional scalability through the use of parallelization.
Learning Better Context Characterizations: An Intelligent Information Retrieval Approach
Lorenzetti, Carlos M., Maguitman, Ana G.
This paper proposes an incremental method that can be used by an intelligent system to learn better descriptions of a thematic context. The method starts with a small number of terms selected from a simple description of the topic under analysis and uses this description as the initial search context. Using these terms, a set of queries are built and submitted to a search engine. New documents and terms are used to refine the learned vocabulary. Evaluations performed on a large number of topics indicate that the learned vocabulary is much more effective than the original one at the time of constructing queries to retrieve relevant material.
Enriching a News Portal with Semantic Information: An Entity-Based Approach
Bocconi, Stefano (Elsevier Labs) | Fogarolli, Angela (University of Trento)
In this paper we describe the production and consumption of linked data in the scenario of the Italian news agency ANSA portal. The goal of the use-case is to provide viewers of a news item with background information and links to related news articles contained on the portal. This information enrichment process is entity-based: ANSA news archive is analyzed using Name Entity Recognition, and each detected entity is annotated with a unique identifier. These identifiers are obtained using the Entity Name Server developed within the scope of the OKKAM European project. Subsequently the news are published on the portal using RDFa and linked to a semantic search engine that provides background information harvested from sources such as DBpedia and links to additional news sources. The presented project has the potential to contribute to Linked Data by creating and publishing a large quantity of entities and assertions about them coming from the ANSA news archive.
Linked Data Integration for Semantic Dialogue and Backend Access
Sonntag, Daniel (German Research Center for AI (DFKI)) | Kiesel, Malte (German Research Center for AI (DFKI))
Over the last several years, the market for speech technology has seen significant developments (Pieraccini and Huerta We learned some lessons which we use as guidelines 2005) and powerful commercial off-the-shelf solutions for in the development of multimodal dialogue systems where speech recognition (ASR) or speech synthesis (TTS). Further users can combine speech and gestures when using multiple application scenarios, more diverse and dynamic information interaction devices. In earlier projects (Wahlster 2003; Reithinger sources, and more complex prototype systems need et al. 2005) we integrated different sub-components to be addressed in the context of QA. Dialogue-based QA allows to multimodal interaction systems. Other lessons served as a user to pose questions in natural speech, followed by guidelines in the development of semantic dialogue systems answers presented in a concise form (Sonntag et al. 2007).
Improving Relevancy Accessing Linked Opinion Data
Galitsky, Boris (University of Girona) | Rosa, Josep Lluis de la (University of Girona) | Dobrocsi, Gรกbor (University of Miskolc)
We introduce a search engine and information retrieval system for providing access to linked opinion data. Natural language technology of generalization of syntactic parse trees is introduced as a similarity measure between subjects of textual opinions to link them on the fly. Information extraction algorithm for automatic summarization of web pages in the format of Google sponsored links is presented. We outline the usability of the implemented system, integrated opinion delivery environment (IODE).
The Web as a Privacy Lab
Chow, Richard (PARC) | Fang, Ji (PARC) | Golle, Philippe (PARC) | Staddon, Jessica (PARC)
The privacy dangers of data proliferation on the Web are well-known. Information on the Web has facilitated the deanonymization of anonymous bloggers, the de-sanitization of government records and the identification of individuals based on search engine queries. What has received less attention is Web-mining in support of privacy. In this position paper we argue that the very ability ofWeb data to breach privacy demonstrates its value as a laboratory for the detection of privacy breaches before they happen. In addition, we argue that privacy-invasive services may become privacy-respecting by mining publicly available Web data, with little decrease in performance and efficiency.
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.
Random Indexing K-tree
De Vries, Christopher M., De Vine, Lance, Geva, Shlomo
The purpose of this paper is to present and analyse the combination of Random Indexing (RI) with the K-tree algorithm. Both RI and K-tree adapt to changing data and decrease the cost of computationally intensive vector based applications. This combination is particularly suitable to the representation and clustering of very large document collections. Documents are typically represented in vector space as very sparse high dimensional vectors. RI can reduce the dimensionality and sparsity of this representation. In turn, the condensed representation is highly effective when working with K-tree. The paper is focused on determining the effectiveness of using RI with K-tree through experiments and comparative analysis of results. Sections 2 to 6 discuss K-tree, Random Indexing, Document Representation, Experimental Setup and Experimental results respectively. The paper ends with a conclusion in Section 7.
Estimating Robust Query Models with Convex Optimization
Query expansion is a long-studied approach for improving retrieval effectiveness by enhancing the user's original query with additional related words. Current algorithms for automatic query expansion can often improve retrieval accuracy on average, but are not robust: that is, they are highly unstable and have poor worst-case performance for individual queries. To address this problem, we introduce anovel formulation of query expansion as a convex optimization problem over a word graph. The model combines initial weights from a baseline feedback algorithmwith edge weights based on word similarity, and integrates simple constraints to enforce set-based criteria such as aspect balance, aspect coverage, and term centrality. Results across multiple standard test collections show consistent andsignificant reductions in the number and magnitude of expansion failures, while retaining the strong positive gains of the baseline algorithm. Our approach does not assume a particular retrieval model, making it applicable to a broad class of existing expansion algorithms.
Unsupervised Learning of Visual Sense Models for Polysemous Words
Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very limited nature of dictionary definitions. The definitions are used to learn a distribution in the latent space that best represents a sense. The algorithm then uses the text surrounding image links to retrieve images with high probability of a particular dictionary sense. An object classifier is trained on the resulting sense-specific images. We evaluate our method on a dataset obtained by searching the web for polysemous words. Category classification experiments show that our dictionary-based approach outperforms baseline methods.