Goto

Collaborating Authors

 Information Retrieval


Have You Tried Using a 'Nearest Neighbor Search'?

#artificialintelligence

Roughly a year and a half ago, I had the privelage of taking a graduate "Introduction to Machine Learning" course under the tutelage of the fantastic Professor Leslie Kaelbling. While I learned a great deal over the course of the semester, there was one minor point that she made to the class which stuck with me more than I expected it to at the time: before using a really fancy or sophisticated or "in-vogue" machine learning algorithm to solve your problem, try a simple Nearest Neighbor Search first. Let's say I gave you a bunch of data points, each with a location in space and a value, and then asked you to predict the value of a new point in space. Perhaps the values of you data are binary (just s and -s) and you've heard of Support Vector Machines. Should you give that a shot?


Automatic Summary Generation for Scientific Data Charts

AAAI Conferences

Scientific charts in the web, whether as images or embedded in digital documents, contain valuable information that is not fully available to information retrieval tools. The information used to describe these charts is typically extracted from the image metadata rather than the information the graphic was initially designed to express. The problem of understanding digital charts found in scholarly documents, and inferring useful textual information from their graphical components is the focus of this study. We present an approach to automatically read the chart data, specifically bar charts, and provide the user with a textual summary of the chart. The proposed method follows a knowledge discovery approach that relies on a versatile graph representation of the chart. This representation is derived from analyzing a chart's original data values, from which useful features are extracted. The data features are in turn used to construct a semantic-graph. To generate a summary, the semantic-graph of the chart is mapped to appropriately crafted protoforms, which are constructs based on fuzzy logic. We verify the effectiveness of our framework by conducting experiments on bar charts extracted from over 1,000 PDF documents. Our preliminary results show that, under certain assumptions, 83% of the produced summaries provide plausible descriptions of the bar charts.


Encoding Lineage in Scholarly Articles

AAAI Conferences

The development of new scientific concepts today is an outcome of the accumulated knowledge built over time. Every scientific domain requires understanding of the trends of the dependencies between its subdomains. Analyses of trends to capture such dependencies using conventional document modeling techniques is a challenging task due to two reasons: (1) conventional vector-space modeling based representation of documents does not realize the history of the content, and (2) neither feature-level nor document-level causality is provided with any digital library metadata or citation network. In this paper, we propose an intuitive temporal representation of a scientific article that encodes inherent historic characteristics of the content. This intuitive representation of each document is then leveraged to discover causal relationships between scientific articles. In addition, we provide a mechanism to explore the lineage of each document in terms of other previously published documents, which illustrates how the theme of the document under analysis evolved over time. Empirical studies reported in the paper show that the proposed technique identifies meaningful causal relationships and discovers meaningful lineage in the scientific literature that could not be discovered through the citation network of the articles.


Automatic Construction of Evaluation Sets and Evaluation of Document Similarity Models in Large Scholarly Retrieval Systems

AAAI Conferences

Retrieval systems for scholarly literature offer the ability for the scientific community to search, explore and download scholarly articles across various scientific disciplines. Mostly used by the experts in the particular field, these systems contain user community logs including information on user specific downloaded articles. In this paper we present a novel approach for automatically evaluating document similarity models in large collections of scholarly publications. Unlike typical evaluation settings that use test collections consisting of query documents and human annotated relevance judgments, we use download logs to automatically generate pseudo-relevant set of similar document pairs. More specifically we show that consecutively downloaded document pairs, extracted from a scholarly information retrieval (IR) system, could be utilized as a test collection for evaluating document similarity models. Another novel aspect of our approach lies in the method that we employ for evaluating the performance of the model by comparing the distribution of consecutively downloaded document pairs and random document pairs in log space. Across two families of similarity models, that represent documents in the term vector and topic spaces, we show that our evaluation approach achieves very high correlation with traditional performance metrics such as Mean Average Precision (MAP), while being more efficient to compute.


Factorized Databases: A Knowledge Compilation Perspective

AAAI Conferences

This paper overviews recent work on compilation of relational queries into lossless factorized representations. The primary motivation for this compilation is to avoid redundancy in the representation of query results and speed up their computation and subsequent analytics.


Creating Content for Google's RankBrain

#artificialintelligence

Google revealed in October that it uses artificial intelligence to help with 15% of search queries. Named RankBrain, the system analyzes vague, ambiguous queries and matches them with the most relevant results. In fact, Google's Greg Corrado told Bloomberg that RankBrain is now the third-highest signal contributing to a search-query result. Google – and similar search-engine services – are getting smarter. As marketers, we no longer can rely solely on traditional digital strategies such as link-building or social-media signaling.


Hollywood removes Netflix from its legal streaming site search engine

The Independent - Tech

Netflix has mysteriously been removed from the American film industry's search engine for legal streaming sites. WhereToWatch, which was set up by the Motion Picture Association of America (MPAA) in 2014, lets users search a range of legal streaming services for their favourite TV shows and films. The idea was to provide internet users with a handy resource which would steer them away from illegal streams and downloads, protecting them from potential legal trouble and helping the studios at the same time. However, as TorrentFreak reports, Netflix has been removed from the WhereToWatch search results, despite being the one of the most-used legal streaming services in the world. Netflix results have also been removed from the UK equivalent of the site, FindAnyFilm, although some Netflix results are still available on GoWatchIt, the search engine which powers WhereToWatch.


700 SQL Queries per Second in Apache Spark with FiloDB

#artificialintelligence

Apache Spark is increasingly thought of as the new jack-of-all-trades distributed platform for big data crunching – what with everything from traditional MapReduce-like workloads, streaming, graph computation, statistics, and machine learning all in one package. Except for Spark Streaming, with its micro-batches, Spark is focused for the most part on higher-latency, rich/complex analytics workloads. What about using Spark as an embedded, web-speed / low-latency query engine? This post will dive into using Apache Spark for low-latency, higher concurrency reporting / dashboard / SQL-like applications - up to hundreds of queries a second! Launching Spark applications on a cluster, or even on localhost, has a pretty high overhead.


Creating an Intelligent Search Engine with Big Data - White Paper

#artificialintelligence

As data grows, organizations are increasingly seeking for an intelligent information discovery and analytics platform that goes beyond keyword searches and better understands users' intent. With Google Now and Cortana, advanced question answering systems are starting to become ubiquitous. Recently, Gartner has also started discussing'insight engines,' a new technology that can provide natural, total, and proactive search, analytics, and discovery. Please let us know the email address we should be sending a PDF copy of the white paper to. A download link will be immediately emailed to you - please check your junk mail if you have a strong email filter.


Bing just became the best search engine for developers

#artificialintelligence

At your day job as a professional code Googler – I mean developer – you probably search for quick snippets multiple times a day to find the best way to perform a particular task. Almost always as developers we end up on Stack Overflow or Mozilla Developer Network, but now Microsoft's Bing has given us something even better: executable code directly in search results. Some of the biggest names in tech are coming to TNW Conference in Amsterdam this May. Thanks to a collaboration with HackerRank, if you search for something like string concat C#, you'll get an interactive code editor with a result that can be run directly from that page to see how it works. It's a seriously fantastic feature that I hope Google adds soon – I'm not sure I'd switch search engine for this, but I'm incredibly jealous.