The World Wide Web provides a wealth of data that can be harnessed to help improve information retrieval and increase understanding of the relationships between different entities. In many cases, we are often interested in determining how similar two entities may be to each other, where the entities may be pieces of text, descriptions of some object, or even the preferences of a group of people. In this work, we examine several instances of this problem, and show how they can be addressed by harnessing data mining techniques applied to large web-based data sets. Specifically, we examine the problems of: (1) determining the similarity of short texts-even those that may not share any terms in common, (2) learning similarity functions for semi-structured data to address tasks such as record linkage between objects, and (3) measuring the similarity between online communities of users as part of a recommendation system. While we present rather different techniques for each problem, we show how measuring similarity between entities in all these domains has a direct application to the overarching goal of improving information access for users of web-based systems.
In the previous post Word Embeddings and Document Vectors: Part 1. Similarity we laid the groundwork for using bag-of-words based document vectors in conjunction with word embeddings (pre-trained or custom-trained) for computing document similarity, as a precursor to classification. It seemed that document word vectors were better at picking up on similarities (or the lack) in toy documents we looked at. We want to carry through with it and apply the approach against actual document repositories to see how the document word vectors do for classification. This post focuses on the approach, the mechanics, and the code snippets to get there. The results will be covered in the next post in this series.
Engineers work best when they can easily find code examples to guide them on particular coding tasks. For some questions -- for example, "How to programmatically close or hide the Android soft keyboard?" But questions specific to proprietary code or APIs (or code written in less common programming languages) need a different solution, since they are not typically discussed in those forums. To address this need, we've developed a code search tool that applies natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. This tool, called Neural Code Search (NCS), accepts natural language queries and returns relevant code fragments retrieved directly from the code corpus.
The power of a Case-Based Reasoning (CBR) System is greatly determined by its capability to retrieve the relevant cases for prediction of the new outcome. The retrieval process involves indexing cases. The combination of nearest neighbour and knowledge-guided techniques for case indexing led to the development of hybrid systems [Cain et al., 1991] joining CBR [Hammond 1986; Kolodner 82] and Explanation-Based Learning (EBL) techniques [Mitchell et al., 1986]. We propose a CBR EBL similarity metric for cases imperfectly described and explained. A case is represented by a past situation, an outcome and a set of explanations of why the situation had such an outcome.