Information Retrieval
Information retrieval document search using vector space model in R
Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki. Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term, we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity .
Search Engine Optimisation: Few Things to Know
SEO or search engine optimisation is an internet marketing process to increase the placement of your website in search results found on search engines like Google and Bing. In order to make your website search engine friendly, SEO companies use some white-hat on-page techniques. In other words, SEO or search engine optimisation includes a set of rules, which are followed by blogs or website owners in order to optimise their websites for search engines. As a business owner one should know what the benefits of SEO services are. SEO is the best marketing strategy to secure your position in the Google algorithm.
Building an end-end search engine
In analytics, we retrieve information from various data sources; it can be structured or unstructured. The biggest challenge here is to retrieve information from unstructured data mainly texts. Here machine learning comes into the picture to overcome this challenge. Different algorithms have been designed in different platforms but here we will discuss one technique that can be applied in python. The process can be explained better by an example.
An introduction to machine-learned ranking in Apache Solr
This tutorial describes how to implement a modern learning to rank (LTR, also called machine-learned ranking) system in Apache Solr. It's intended for people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials I found online. This is my attempt at writing the tutorial I wish I had when I was getting started. Firing up a vanilla Solr instance on Linux (Fedora, in my case) is actually pretty straightforward.
Processing a Trillion Rows Per Second on a Single Machine: How Can Nested Loop Joins be this Fast?
This blog post describes our experience debugging a failing test case caused by a cross join query running "too fast." Because the root cause of fail test case spans across multiple layers--from Apache Spark to the JVM JIT compiler-- we wanted to share our analysis in this post. The vast majority of big data SQL or MPP engines follow the Volcano iterator architecture that is inefficient for analytical workloads. Since Spark 2.0 release, the new Tungsten execution engine in Apache Spark implements whole-stage code generation, a technique inspired by modern compilers to collapse the entire query into a single function. This JIT compiler approach is a far superior architecture than the row-at-a-time processing or code generation model employed by other engines, making Spark one of the most efficient in the market.
Data Matching โ Entity Identification, Resolution & Linkage
Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching. A major challenge in data matching is the lack of common entity identifiers across different source systems to be matched. As a result of this, the matching needs to be conducted using attributes that contain partially identifying information, such as names, addresses, or dates of birth.
A Metasearch Engine That Learns Which Search Engines to Query
Search engines are among the most successful applications on the web today. So many search engines have been created that it is difficult for users to know where they are, how to use them, and what topics they best address. Metasearch engines reduce the user burden by dispatching queries to multiple search engines in parallel. Not too surprisingly then, the most successful applications on the web to date are search engines: tools that assist users in finding information on specific topics. The first decision requires reasoning about the available resources and the second about ranking the search engines.
How to build a search engine: Part 3
Assuming the dataset is named "people_wiki.csv", Executing this script will result in steaming logs which is ultimately leading to the data getting indexed in elasticsearch. That's how easy it is! Let's spend the next few lines on what actually happened. We declare our elasticsearch object configured on our local machine. Once that object is initialized we will use it to index all of our data.
A General Framework for Robust Interactive Learning
Emamjomeh-Zadeh, Ehsan, Kempe, David
We propose a general framework for interactively learning models, such as (binary or non-binary) classifiers, orderings/rankings of items, or clusterings of data points. Our framework is based on a generalization of Angluin's equivalence query model and Littlestone's online learning model: in each iteration, the algorithm proposes a model, and the user either accepts it or reveals a specific mistake in the proposal. The feedback is correct only with probability p > 1/2 (and adversarially incorrect with probability 1 - p), i.e., the algorithm must be able to learn in the presence of arbitrary noise. The algorithm's goal is to learn the ground truth model using few iterations. Our general framework is based on a graph representation of the models and user feedback. To be able to learn efficiently, it is sufficient that there be a graph G whose nodes are the models, and (weighted) edges capture the user feedback, with the property that if s, s* are the proposed and target models, respectively, then any (correct) user feedback s' must lie on a shortest s-s* path in G. Under this one assumption, there is a natural algorithm, reminiscent of the Multiplicative Weights Update algorithm, which will efficiently learn s* even in the presence of noise in the user's feedback. From this general result, we rederive with barely any extra effort classic results on learning of classifiers and a recent result on interactive clustering; in addition, we easily obtain new interactive learning algorithms for ordering/ranking.
Query Complexity of Clustering with Side Information
Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively. Many heuristics have been proposed, and all of these use a similarity function to come up with a querying strategy. Even so, there is a lack systematic theoretical study. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$.