Information Retrieval
Non-metric Similarity Graphs for Maximum Inner Product Search
Morozov, Stanislav, Babenko, Artem
In this paper we address the problem of Maximum Inner Product Search (MIPS) that is currently the computational bottleneck in a large number of machine learning applications. While being similar to the nearest neighbor search (NNS), the MIPS problem was shown to be more challenging, as the inner product is not a proper metric function. We propose to solve the MIPS problem with the usage of similarity graphs, i.e., graphs where each vertex is connected to the vertices that are the most similar in terms of some similarity function. Originally, the framework of similarity graphs was proposed for metric spaces and in this paper we naturally extend it to the non-metric MIPS scenario. We demonstrate that, unlike existing approaches, similarity graphs do not require any data transformation to reduce MIPS to the NNS problem and should be used for the original data. Moreover, we explain why such a reduction is detrimental for similarity graphs. By an extensive comparison to the existing approaches, we show that the proposed method is a game-changer in terms of the runtime/accuracy trade-off for the MIPS problem.
Norm-Ranging LSH for Maximum Inner Product Search
Yan, Xiao, Li, Jinfeng, Dai, Xinyan, Chen, Hongzhi, Cheng, James
Neyshabur and Srebro proposed SIMPLE-LSH, which is the state-of-the-art hashing based algorithm for maximum inner product search (MIPS). We found that the performance of SIMPLE-LSH, in both theory and practice, suffers from long tails in the 2-norm distribution of real datasets. We propose NORM-RANGING LSH, which addresses the excessive normalization problem caused by long tails by partitioning a dataset into sub-datasets and building a hash index for each sub-dataset independently. We prove that NORM-RANGING LSH achieves lower query time complexity than SIMPLE-LSH under mild conditions. We also show that the idea of dataset partitioning can improve another hashing based MIPS algorithm. Experiments show that NORM-RANGING LSH probes much less items than SIMPLE-LSH at the same recall, thus significantly benefiting MIPS based applications.
Query Complexity of Bayesian Private Learning
We study the query complexity of Bayesian Private Learning: a learner wishes to locate a random target within an interval by submitting queries, in the presence of an adversary who observes all of her queries but not the responses. How many queries are necessary and sufficient in order for the learner to accurately estimate the target, while simultaneously concealing the target from the adversary? Our main result is a query complexity lower bound that is tight up to the first order. We show that if the learner wants to estimate the target within an error of $\epsilon$, while ensuring that no adversary estimator can achieve a constant additive error with probability greater than $1/L$, then the query complexity is on the order of $L\log(1/\epsilon)$ as $\epsilon \to 0$. Our result demonstrates that increased privacy, as captured by $L$, comes at the expense of a \emph{multiplicative} increase in query complexity. The proof builds on Fano's inequality and properties of certain proportional-sampling estimators.
SMART goals for SEO - Search Engine Land
As 2017 comes to a close, many SEOs will be looking forward and setting some goals for their campaigns in 2018. In this post, I am going to take a look at the SMART goals methodology that can help you set and achieve aggressive, yet realistic goals. SMART goals set out a series of criteria that can be used for setting marketing objectives. This is all wrapped up in the clever mnemonic acronym -- Specific, Measurable, Achievable, Realistic and Timelined -- which makes SMART goals so easy to remember. Specific objectives are crucial to success in any marketing campaign.
How Google took on China--and lost
Google's first foray into Chinese markets was a short-lived experiment. Google China's search engine was launched in 2006 and abruptly pulled from mainland China in 2010 amid a major hack of the company and disputes over censorship of search results. But in August 2018, the investigative journalism website The Intercept reported that the company was working on a secret prototype of a new, censored Chinese search engine, called Project Dragonfly. Amid a furor from human rights activists and some Google employees, US Vice President Mike Pence called on the company to kill Dragonfly, saying it would "strengthen Communist Party censorship and compromise the privacy of Chinese customers." In mid-December, The Intercept reported that Google had suspended its development efforts in response to complaints from the company's own privacy team, who learned about the project from the investigative website's reporting. Observers talk as if the decision about whether to reenter the world's largest market is up to Google: will it compromise its principles and censor search the way China wants?
Sequence to Sequence Learning for Query Expansion
As fas as we are aware, using sequence to sequence algorithms for query expansion hasnot been explored yet in Information Retrievalliterature nor in Question-Answering's. We tried to fill this gap in the literature with a custom Query Expansion system trained and tested on open datasets. One specificity of our engine compared to classic ones is that it does not need the documents to expand the introduced query. We test our expansions on three different tasks: Information Retrieval, Answer preselection and Text classification. Our method yielded a slight improvement in performance in the three tasks .
Quicker ADC : Unlocking the hidden potential of Product Quantization with SIMD
Andrรฉ, Fabien, Kermarrec, Anne-Marie, Scouarnec, Nicolas Le
Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. A common approach is to rely on Product Quantization that allows storing large vector databases in memory and also allows efficient distance computations. Yet, implementations of nearest neighbor search with Product Quantization have their performance limited by the many memory accesses they perform. Following this observation, Andr\'e et al. proposed more efficient implementations of $m\times{}4$ product quantizers (PQ) leveraging specific SIMD instructions. Quicker ADC contributes additional implementations not limited to $m\times{}4$ codes and relying on AVX-512, the latest revision of SIMD instruction set. In doing so, Quicker ADC faces the challenge of using efficiently 5,6 and 7-bit shuffles that do not align to computer bytes or words. To this end, we introduce (i) irregular product quantizers combining sub-quantizers of different granularity and (ii) split tables allowing lookup tables larger than registers. We evaluate Quicker ADC with multiple indexes including Inverted Multi-Indexes and IVF HNSW and show that it outperforms FAISS PQ implementation and optimization (i.e., Polysemous codes) for numerous configurations. Finally, we open-source at http://github.com/technicolor-research/faiss-quickeradc a fork of FAISS that includes Quicker ADC.
Google to 'shut down plans' for censored Chinese search engine
Google has been forced to abandon its specialist Chinese search engine that censors results in line with the strict government, reports have claimed. The firm is believed to have shut down an internal data analysis system which was being used to develop the search engine, known as Dragonfly. According to a report from The Intercept, this has'effectively ended' the entire project. Members of Google's privacy team raised concerns about the project back in August and it is now extremely unlikely the search engine can be built without the system, according to sources close to the project. Google has been forced to abandon its plan to launch a specialist Chinese search engine that censors results in line with the strict government.
Efficient Autotuning of Hyperparameters in Approximate Nearest Neighbor Search
Jรครคsaari, Elias, Hyvรถnen, Ville, Roos, Teemu
Approximate nearest neighbor algorithms are used to speed up nearest neighbor search in a wide array of applications. However, current indexing methods feature several hyperparameters that need to be tuned to reach an acceptable accuracy--speed trade-off. A grid search in the parameter space is often impractically slow due to a time-consuming index-building procedure. Therefore, we propose an algorithm for automatically tuning the hyperparameters of indexing methods based on randomized space-partitioning trees. In particular, we present results using randomized k-d trees, random projection trees and randomized PCA trees. The tuning algorithm adds minimal overhead to the index-building process but is able to find the optimal hyperparameters accurately. We demonstrate that the algorithm is significantly faster than existing approaches, and that the indexing methods used are competitive with the state-of-the-art methods in query time while being faster to build.
Google's China search engine project 'effectively ended': report
Members of the House Judiciary Committee peppered the head of Google about potential bias against conservatives and Russian influence and misinformation; Gillian Turner reports. Google has been forced to shut down and "effectively end" its controversial China search engine project, code-named Project Dragonfly, after members of the company's privacy team raised complaints, according to a new report. The tech giant led by CEO Sundar Pichai was forced to close a data analysis system it was using for the controversial project, according to The Intercept, citing two sources familiar with the matter. The news outlet originally broke the news that Google had been considering launching the app-based search engine. Google has not yet responded to a request for comment from Fox News.