Goto

Collaborating Authors

 Information Retrieval


Binary Embedding with Additive Homogeneous Kernels

AAAI Conferences

Binary embedding transforms vectors in Euclidean space into the vertices of Hamming space such that Hamming distance between binary codes reflects a particular distance metric. In machine learning, the similarity metrics induced by Mercer kernels are frequently used, leading to the development of binary embedding with Mercer kernels (BE-MK) where the approximate nearest neighbor search is performed in a reproducing kernel Hilbert space (RKHS). Kernelized locality-sensitive hashing (KLSH), which is one of the representative BE-MK, uses kernel PCA to embed data points into a Euclidean space, followed by the random hyperplane binary embedding. In general, it works well when the query and data points in the database follow the same probability distribution. The streaming data environment, however, continuously requires KLSH to update the leading eigenvectors of the Gram matrix, which can be costly or hard to carry out in practice. In this paper we present a completely randomized binary embedding to work with a family of additive homogeneous kernels, referred to as BE-AHK. The proposed algorithm is easy to implement, built on Vedaldi and Zisserman's work on explicit feature maps for additive homogeneous kernels. We show that our BE-AHK is able to preserve kernel values by developing an upper- and lower-bound on its Hamming distance, which guarantees to solve approximate nearest neighbor search efficiently. Numerical experiments demonstrate that BE-AHK actually yields similarity-preserving binary codes in terms of additive homogeneous kernels and is superior to existing methods in case that training data and queries are generated from different distributions. Moreover, in cases where a large code size is allowed, the performance of BE-AHK is comparable to that of KLSH in general cases.


A Theoretical Analysis of First Heuristics of Crowdsourced Entity Resolution

AAAI Conferences

Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Due to inherent ambiguity of data representation and poor data quality, ER is a challenging task for any automated process. As a remedy, human-powered ER via crowdsourcing has become popular in recent years. Using crowd to answer queries is costly and time consuming. Furthermore, crowd-answers can often be faulty. Therefore, crowd-based ER methods aim to minimize human participation without sacrificing the quality and use a computer generated similarity matrix actively. While, some of these methods perform well in practice, no theoretical analysis exists for them, and further their worst case performances do not reflect the experimental findings. This creates a disparity in the understanding of the popular heuristics for this problem. In this paper, we make the first attempt to close this gap. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based ER. We justify experimental observations with our analysis and information theoretic lower bounds.


elasticsearchr – a Lightweight Elasticsearch Client for R

#artificialintelligence

Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an'analytics database' for R&D, production-use or both. Installation is simple, it ships with sensible default settings that allow it to work effectively out-of-the-box, and all interaction is made via a set of intuitive and extremely well documented RESTful APIs. I've been using it for two years now and I am evangelical. The elasticsearchr package implements a simple Domain-Specific Language (DSL) for indexing, deleting, querying, sorting and aggregating data in Elasticsearch, from within R. The main purpose of this package is to remove the labour involved with assembling HTTP requests to Elasticsearch's REST APIs and processing the responses. Instead, users of this package need only send and receive data frames to Elasticsearch resources.


How to build a search engine - Part 2: Configuring elasticsearch

@machinelearnbot

In this post we will focus on configuring the elasticsearch bit. I have chosen the Wikipedia people dump for the dataset. This is the wiki pages of a subset of people on Wikipedia. This dataset consists of three columns – URI, name, text. As the column names suggest, URI is the actual wiki link to that person's page, name is the person's name.


How Search Engines Use Machine Learning for Pattern Detection

AITopics Original Links

Search engines use machine learning for pattern detection. While it's impossible to explain in one short article how machine learning influences our lives, understanding the basics of machine learning can give you some insight into search algorithm updates, such as Google's Panda update. To predict the outcome of future tests, scripts can use supervised learning on past outcomes to define a hypothetical prediction line. The three images below show how plotted examples define averages. These averages are more likely to represent some truth as the training set grows.


Creative Commons' New Search Engine Makes It Easy To Find Free-To-Use Images

Forbes - Tech

Credit: "Busted" by Jason Scragz is licensed under CC BY 2.0 You copied an image on your blog that you saw on the internet. You didn't think you were doing anything wrong but it turns out you were. How can you avoid all this by finding images that are free to use? Creative Commons is here to help you out. How can you find these images? Google's Advance Image Search has a drop down box that allows you to restrict a search by different types of Creative Commons license.


Facebook Search Now Recognizes Objects in Photos - Search Engine Journal

#artificialintelligence

Facebook's artificial intelligence (AI) team has built a visual search system that can recognize content that appears in photos and return relevant search results. Called Lumos, Facebook originally created the platform so that its visually impaired users could understand the content of photos. But Facebook recognized that everyone could benefit from this type of visual search system. Facebook's image search system can detect and segment objects, scenes, animals, places, and clothes that appear in images or videos – and understand them. For instance, let's say you search for "black shirt photo."


ATOL: A Framework for Automated Analysis and Categorization of the Darkweb Ecosystem

AAAI Conferences

We present a framework for automated analysis and categorization of .onion websites in the darkweb to facilitate analyst situational awareness of new content that emerges from this dynamic landscape. Over the last two years, our team has developed a large-scale darkweb crawling infrastructure called OnionCrawler that acquires new onion domains on a daily basis, and crawls and indexes millions of pages from these new and previously known .onion sites. It stores this data into a research repository designed to help better understand Tor’s hidden service ecosystem. The analysis component of our framework is called Automated Tool for Onion Labeling (ATOL), which introduces a two-stage thematic labeling strategy: (1) it learns descriptive and discriminative keywords for different categories, and (2) uses these terms to map onion site content to a set of thematic labels. We also present empirical results of ATOL and our ongoing experimentation with it, as we have gained experience applying it to the entirety of our darkweb repository, now over 70 million indexed pages. We find that ATOL can perform site-level thematic label assignment more accurately than keywordbased schemes developed by domain experts — we expand the analyst-provided keywords using an automatic keyword discovery algorithm, and get 12% gain in accuracy by using a machine learning classification model. We also show how ATOL can discover categories on previously unlabeled onions and discuss applications of ATOL in supporting various analyses and investigations of the darkweb.


Building an end-end search engine

@machinelearnbot

In analytics, we retrieve information from various data sources; it can be structured or unstructured. The biggest challenge here is to retrieve information from unstructured data mainly texts. Here machine learning comes into the picture to overcome this challenge. Different algorithms have been designed in different platforms but here we will discuss one technique that can be applied in python. The process can be explained better by an example.


Zuckerberg charity buys AI search engine to battle disease

Daily Mail - Science & tech

A charitable foundation backed by Mark Zuckerberg and his wife said Monday it has bought a Canadian artificial intelligence startup as part of a mission to eradicate disease. The Chan Zuckerberg Initiative did not disclose financial terms of the deal to acquire Toronto-based Meta, which uses AI to quickly read and comprehend scientific papers and then provide insights to researchers. Meta capabilities will be unified in a tool made available for free to scientists. Meta artificial intelligence can analyze insights across millions of papers, finding connections and patterns at scales and speeds impossible for humans to match unassisted. In the field of biomedicine alone, thousands of research papers are published daily.