Goto

Collaborating Authors

 Information Retrieval


Privacy-focused, rewarded ads browser Brave tops 10M monthly active users - Search Engine Land

#artificialintelligence

Brave said it has seen a surge in user adoption since releasing version 1.0 of the privacy-centric browser on November 13, 2019. Monthly active users (MAU) have doubled in a year to 10.4 million as of the end of last month. Daily active users of the browser created by Mozilla founder Brendan Eich have tripled in the last year to 3.3 million, the company said Friday. Brave Ads are structured to serve only to users that opt-in to the Brave Rewards program and agree to see ads. Users can then accumulate Brave's Basic Attention Token (BAT), which is a blockchain-based system.


Rethinking Search Engines and Recommendation Systems

Communications of the ACM

In her popular book, Weapons of Math Destruction, data scientist Cathy O'Neil elegantly describes to the general population the danger of the data science revolution in decision making. She describes how the US News ranking of universities, which orders universities based on 15 measured properties, created new dynamics in university behavior, as they adapted to these measures, ultimately resulting in decreased social welfare. Unfortunately, the idea that data science-related algorithms, such as ranking, cause changes in behavior, and that this dynamic may lead to socially inferior outcomes, is dominant in our new online economy. Ranking also plays a crucial role in search engines and recommendation systems--two prominent data science applications that we focus on in this article. Recommendation systems endorse items by ranking them using information induced from some context--for example, the Web page a user is currently browsing, a specific application the user is running on her mobile phone, or the time of day.


Keyword Aware Influential Community Search in Large Attributed Graphs

arXiv.org Artificial Intelligence

We introduce a novel keyword-aware influential community query KICQ that finds the most influential communities from an attributed graph, where an influential community is defined as a closely connected group of vertices having some dominance over other groups of vertices with the expertise (a set of keywords) matching with the query terms (words or phrases). We first design the KICQ that facilitates users to issue an influential CS query intuitively by using a set of query terms, and predicates (AND or OR). In this context, we propose a novel word-embedding based similarity model that enables semantic community search, which substantially alleviates the limitations of exact keyword based community search. Next, we propose a new influence measure for a community that considers both the cohesiveness and influence of the community and eliminates the need for specifying values of internal parameters of a network. Finally, we propose two efficient algorithms for searching influential communities in large attributed graphs. We present detailed experiments and a case study to demonstrate the effectiveness and efficiency of the proposed approaches.


Forward and Backward Feature Selection for Query Performance Prediction

arXiv.org Machine Learning

The goal of query performance prediction (QPP) is to automatically estimate the effectiveness of a search result for any given query, without relevance judgements. Post-retrieval features have been shown to be more effective for this task while being more expensive to compute than pre-retrieval features. Combining multiple post-retrieval features is even more effective, but state-of-the-art QPP methods are impossible to interpret because of the black-box nature of the employed machine learning models. However, interpretation is useful for understanding the predictive model and providing more answers about its behavior. Moreover, combining many post-retrieval features is not applicable to real-world cases, since the query running time is of utter importance. In this paper, we investigate a new framework for feature selection in which the trained model explains well the prediction. We introduce a step-wise (forward and backward) model selection approach where different subsets of query features are used to fit different models from which the system selects the best one. We evaluate our approach on four TREC collections using standard QPP features. We also develop two QPP features to address the issue of query-drift in the query feedback setting. We found that: (1) our model based on a limited number of selected features is as good as more complex models for QPP and better than non-selective models; (2) our model is more efficient than complex models during inference time since it requires fewer features; (3) the predictive model is readable and understandable; and (4) one of our new QPP features is consistently selected across different collections, proving its usefulness.


Information Retrieval and Its Sister Disciplines

arXiv.org Artificial Intelligence

This article presents a summary graph to show the relationships between Information Retrieval (IR) and other related disciplines. The figure tells the key differences between them and the conditions under which one would transition into another. When I studied Machine Learning (ML), my favorite figure among all was "The Table of Common Distributions" in Casella and Berger's 2002 book "Statistical Inference". It appeared in the book's appendix. Every time when I saw this figure, I was in awe.


A Contextual-Bandit Approach to Online Learning to Rank for Relevance and Diversity

arXiv.org Machine Learning

Online learning to rank (LTR) focuses on learning a policy from user interactions that builds a list of items sorted in decreasing order of the item utility. It is a core area in modern interactive systems, such as search engines, recommender systems, or conversational assistants. Previous online LTR approaches either assume the relevance of an item in the list to be independent of other items in the list or the relevance of an item to be a submodular function of the utility of the list. The former type of approach may result in a list of low diversity that has relevant items covering the same aspects, while the latter approaches may lead to a highly diversified list but with some non-relevant items. In this paper, we study an online LTR problem that considers both item relevance and topical diversity. We assume cascading user behavior, where a user browses the displayed list of items from top to bottom and clicks the first attractive item and stops browsing the rest. We propose a hybrid contextual bandit approach, called CascadeHybrid, for solving this problem. CascadeHybrid models item relevance and topical diversity using two independent functions and simultaneously learns those functions from user click feedback. We derive a gap-free bound on the n-step regret of CascadeHybrid. We conduct experiments to evaluate CascadeHybrid on the MovieLens and Yahoo music datasets. Our experimental results show that CascadeHybrid outperforms the baselines on both datasets.


80% of Major US E-commerce Stores Use JavaScript for Crucial Content - Search Engine Journal

#artificialintelligence

According to new data, up to 80% of popular US-based e-commerce stores use JavaScript on crucial content such as product descriptions. That's an alarming number considering any time JavaScript is used to generate important content it runs the risk of not being indexed in search results. Google is getting better at crawling and rendering JavaScript, but it's not perfect. The company still recommends using static HTML as much as possible. The study, from software company Onely, found that 25% of web pages analyzed contained crucial JavaScript content that was unindexed by Google.


scikit-hubness: Hubness Reduction and Approximate Neighbor Search

arXiv.org Machine Learning

This paper introduces scikit-hubness, a Python package for efficient nearest neighbor search in high-dimensional spaces. Hubness is an aspect of the curse of dimensionality, and is known to impair various learning tasks, including classification, clustering, and visualization. scikit-hubness provides algorithms for hubness analysis ("Is my data affected by hubness?"), hubness reduction ("How can we improve neighbor retrieval in high dimensions?"), and approximate neighbor search ("Does it work for large data sets?"). It is integrated into the scikit-learn environment, enabling rapid adoption by Python-based machine learning researchers and practitioners. Users will find all functionality of the scikit-learn neighbors package, plus additional support for transparent hubness reduction and approximate nearest neighbor search. scikit-hubness is developed using several quality assessment tools and principles, such as PEP8 compliance, unit tests with high code coverage, continuous integration on all major platforms (Linux, MacOS, Windows), and additional checks by LGTM. The source code is available at https://github.com/VarIr/scikit-hubness under the BSD 3-clause license. Install from the Python package index with $ pip install scikit-hubness.


Latent Semantic Search and Information Extraction Architecture

arXiv.org Artificial Intelligence

The motivation, concept, design and implementation of latent semantic search for search engines have limited semantic search, entity extraction and property attribution features, have insufficient accuracy and response time of latent search, may impose privacy concerns and the search results are unavailable in offline mode for robotic search operations. The alternative suggestion involves autonomous search engine with adaptive storage consumption, configurable search scope and latent search response time with built-in options for entity extraction and property attribution available as open source platform for mobile, desktop and server solutions. The suggested architecture attempts to implement artificial general intelligence (AGI) principles as long as autonomous behaviour constrained by limited resources is concerned, and it is applied for specific task of enabling Web search for artificial agents implementing the AGI.


Semantic Search Engine & Search Analytics Platform for Business

#artificialintelligence

Don't be limited by a search engine that doesn't understand the user intent or the context. Enjoy the power of highly targeted, intuitive and conceptual search and exploration. To help you easily skim through the results, we offer a wide variety of options like Clustering, Semantic Cloud and Intuitive Facets. You could also get exploratory with our Concept Search. We promise a quicker and better search every time!