AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Information retrieval document search using vector space model in R

@machinelearnbotJan-21-2018, 02:33:09 GMT

Note, there are many variations in the way we calculate the term-frequency(tf) and inverse document frequency (idf), in this post we have seen one variation. Below images show as the other recommended variations of tf and idf, taken from wiki. Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. In similar lines, we can calculate cosine angle between each document vector and the query vector to find its closeness. To find relevant document to the query term, we may calculate the similarity score between each document vector and the query term vector by applying cosine similarity .

artificial intelligence, machine learning, natural language, (15 more...)

@machinelearnbot

Country:

North America > United States > Illinois > Cook County > Chicago (0.08)
North America > United States > Hawaii > Honolulu County > Honolulu (0.06)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.44)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Search Engine Optimisation: Few Things to Know

@machinelearnbotJan-20-2018, 22:46:22 GMT

SEO or search engine optimisation is an internet marketing process to increase the placement of your website in search results found on search engines like Google and Bing. In order to make your website search engine friendly, SEO companies use some white-hat on-page techniques. In other words, SEO or search engine optimisation includes a set of rules, which are followed by blogs or website owners in order to optimise their websites for search engines. As a business owner one should know what the benefits of SEO services are. SEO is the best marketing strategy to secure your position in the Google algorithm.

artificial intelligence, information retrieval, natural language, (12 more...)

@machinelearnbot

Industry:

Marketing (0.53)
Information Technology > Services (0.33)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Building an end-end search engine

@machinelearnbotJan-20-2018, 04:49:26 GMT

In analytics, we retrieve information from various data sources; it can be structured or unstructured. The biggest challenge here is to retrieve information from unstructured data mainly texts. Here machine learning comes into the picture to overcome this challenge. Different algorithms have been designed in different platforms but here we will discuss one technique that can be applied in python. The process can be explained better by an example.

artificial intelligence, information retrieval, natural language, (11 more...)

@machinelearnbot

Country: North America > United States (0.16)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.56)

Add feedback

An introduction to machine-learned ranking in Apache Solr

#artificialintelligenceJan-16-2018, 05:21:38 GMT

This tutorial describes how to implement a modern learning to rank (LTR, also called machine-learned ranking) system in Apache Solr. It's intended for people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials I found online. This is my attempt at writing the tutorial I wish I had when I was getting started. Firing up a vanilla Solr instance on Linux (Fedora, in my case) is actually pretty straightforward.

information retrieval, machine learning, natural language, (17 more...)

#artificialintelligence

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

Processing a Trillion Rows Per Second on a Single Machine: How Can Nested Loop Joins be this Fast?

@machinelearnbotJan-11-2018, 21:12:24 GMT

This blog post describes our experience debugging a failing test case caused by a cross join query running "too fast." Because the root cause of fail test case spans across multiple layers--from Apache Spark to the JVM JIT compiler-- we wanted to share our analysis in this post. The vast majority of big data SQL or MPP engines follow the Volcano iterator architecture that is inefficient for analytical workloads. Since Spark 2.0 release, the new Tungsten execution engine in Apache Spark implements whole-stage code generation, a technique inspired by modern compilers to collapse the entire query into a single function. This JIT compiler approach is a far superior architecture than the row-at-a-time processing or code generation model employed by other engines, making Spark one of the most efficient in the market.

artificial intelligence, natural language, optimization, (16 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.91)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.76)
Information Technology > Communications > Social Media (0.72)

Add feedback

Data Matching – Entity Identification, Resolution & Linkage

@machinelearnbotJan-4-2018, 21:15:49 GMT

Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching. A major challenge in data matching is the lack of common entity identifiers across different source systems to be matched. As a result of this, the matching needs to be conducted using attributes that contain partially identifying information, such as names, addresses, or dates of birth.

artificial intelligence, information retrieval, natural language, (4 more...)

@machinelearnbot

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.42)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.40)

Add feedback

A Metasearch Engine That Learns Which Search Engines to Query

AI MagazineJan-4-2018, 15:03:58 GMT

Search engines are among the most successful applications on the web today. So many search engines have been created that it is difficult for users to know where they are, how to use them, and what topics they best address. Metasearch engines reduce the user burden by dispatching queries to multiple search engines in parallel. Not too surprisingly then, the most successful applications on the web to date are search engines: tools that assist users in finding information on specific topics. The first decision requires reasoning about the available resources and the second about ranking the search engines.

artificial intelligence, engine, information management, (18 more...)

AI Magazine

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

How to build a search engine: Part 3

@machinelearnbotJan-1-2018, 12:40:34 GMT

Assuming the dataset is named "people_wiki.csv", Executing this script will result in steaming logs which is ultimately leading to the data getting indexed in elasticsearch. That's how easy it is! Let's spend the next few lines on what actually happened. We declare our elasticsearch object configured on our local machine. Once that object is initialized we will use it to index all of our data.

elasticsearch, information retrieval, natural language, (7 more...)

@machinelearnbot

Technology:

Information Technology > Information Management > Search (0.48)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)

Add feedback

A General Framework for Robust Interactive Learning

Emamjomeh-Zadeh, Ehsan, Kempe, David

Neural Information Processing SystemsDec-31-2017

We propose a general framework for interactively learning models, such as (binary or non-binary) classifiers, orderings/rankings of items, or clusterings of data points. Our framework is based on a generalization of Angluin's equivalence query model and Littlestone's online learning model: in each iteration, the algorithm proposes a model, and the user either accepts it or reveals a specific mistake in the proposal. The feedback is correct only with probability p > 1/2 (and adversarially incorrect with probability 1 - p), i.e., the algorithm must be able to learn in the presence of arbitrary noise. The algorithm's goal is to learn the ground truth model using few iterations. Our general framework is based on a graph representation of the models and user feedback. To be able to learn efficiently, it is sufficient that there be a graph G whose nodes are the models, and (weighted) edges capture the user feedback, with the property that if s, s* are the proposed and target models, respectively, then any (correct) user feedback s' must lie on a shortest s-s* path in G. Under this one assumption, there is a natural algorithm, reminiscent of the Multiplicative Weights Update algorithm, which will efficiently learn s* even in the presence of noise in the user's feedback. From this general result, we rederive with barely any extra effort classic results on learning of classifiers and a recent result on interactive clustering; in addition, we easily obtain new interactive learning algorithms for ordering/ranking.

information retrieval, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Industry: Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback

Query Complexity of Clustering with Side Information

Mazumdar, Arya, Saha, Barna

Neural Information Processing SystemsDec-31-2017

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, ``do two elements $u$ and $v$ belong to the same cluster?''. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. To improve accuracy of clustering, a fruitful approach in recent years has been to ask a domain expert or crowd to obtain labeled data interactively. Many heuristics have been proposed, and all of these use a similarity function to come up with a querying strategy. Even so, there is a lack systematic theoretical study. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $\Theta(nk)$ (no similarity matrix) to $O(\frac{k^2\log{n}}{\cH^2(f_+\|f_-)})$ where $\cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(\log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback