Collaborating Authors


DSC Data Science Search Engine


Embracing Responsible AI from Pilot to Production - May 27 On average, 80% of AI projects fail to make it to production. But it IS possible to successfully launch AI, at scale, that is built responsibly and works for everyone. How you scale from pilot to production is critical to ensuring AI success, while continuing to be a good corporate citizen through responsible productization.

Eliminating Search Intent Bias in Learning to Rank Machine Learning

Click-through data has proven to be a valuable resource for improving search-ranking quality. Search engines can easily collect click data, but biases introduced in the data can make it difficult to use the data effectively. In order to measure the effects of biases, many click models have been proposed in the literature. However, none of the models can explain the observation that users with different search intent (e.g., informational, navigational, etc.) have different click behaviors. In this paper, we study how differences in user search intent can influence click activities and determined that there exists a bias between user search intent and the relevance of the document relevance. Based on this observation, we propose a search intent bias hypothesis that can be applied to most existing click models to improve their ability to learn unbiased relevance. Experimental results demonstrate that after adopting the search intent hypothesis, click models can better interpret user clicks and substantially improve retrieval performance.

Rethinking Search Engines and Recommendation Systems

Communications of the ACM

In her popular book, Weapons of Math Destruction, data scientist Cathy O'Neil elegantly describes to the general population the danger of the data science revolution in decision making. She describes how the US News ranking of universities, which orders universities based on 15 measured properties, created new dynamics in university behavior, as they adapted to these measures, ultimately resulting in decreased social welfare. Unfortunately, the idea that data science-related algorithms, such as ranking, cause changes in behavior, and that this dynamic may lead to socially inferior outcomes, is dominant in our new online economy. Ranking also plays a crucial role in search engines and recommendation systems--two prominent data science applications that we focus on in this article. Recommendation systems endorse items by ranking them using information induced from some context--for example, the Web page a user is currently browsing, a specific application the user is running on her mobile phone, or the time of day.

Thatware is redefining digital marketing with artificial intelligence Forbes India


West Bengal, India, Oct 11, 2019 Innovative technologies have always been a norm for empowering new and old businesses. In this competitive era, it has been seen that the revenue generation of a company is directly proportional to the development done in the field of technology. Studies have shown that digital marketing industry is changing and evolving at a rapid scale. Based on a study, it has been found that 76% of digital marketing success can be obtained from search engine optimization alone. With that being said, people from all around the planet are working immensely hard for getting the right amount of search optimization done for their online inventory or websites.

Content-Based Features to Rank Influential Hidden Services of the Tor Darknet Machine Learning

The unevenness importance of criminal activities in the onion domains of the Tor Darknet and the different levels of their appeal to the end-user make them tangled to measure their influence. To this end, this paper presents a novel content-based ranking framework to detect the most influential onion domains. Our approach comprises a modeling unit that represents an onion domain using forty features extracted from five different resources: user-visible text, HTML markup, Named Entities, network topology, and visual content. And also, a ranking unit that, using the Learning-to-Rank (LtR) approach, automatically learns a ranking function by integrating the previously obtained features. Using a case-study based on drugs-related onion domains, we obtained the following results. (1) Among the explored LtR schemes, the listwise approach outperforms the benchmarked methods with an NDCG of 0.95 for the top-10 ranked domains. (2) We proved quantitatively that our framework surpasses the link-based ranking techniques. Also, (3) with the selected feature, we observed that the textual content, composed by text, NER, and HTML features, is the most balanced approach, in terms of efficiency and score obtained. The proposed framework might support Law Enforcement Agencies in detecting the most influential domains related to possible suspicious activities.

Search Engine Founder says Artificial Intelligence cannot be created


Every day, there is a new report, news item, scientific publication where some company or the other, some research team, some start up claims to have launched a product built with Artificial Intelligence, or to have achieved a breakthrough in this field, or promises a new product which will change the entire field. Unfortunately, the term Artificial Intelligence or AI for short, has to be the most over abused term by scientists, computer programmers, start up entrepreneurs and the tech media alike. It is still in close competition with the term Big Data, though. My name is Sukhbir Benipal and i am the founder and creator of the e commerce search engine, I have been working in this field for over 5 years and tried endlessly, at various points even believing i had a breakthrough, until one day when Hurricane Sandy hit Manhattan, and with no power, heat or running hot water, finally realizing i was so wrong, on all counts.

A review on ranking problems in statistical learning Machine Learning

Search-engines like Google provide a list of websites that are suitable for the user's query in the sense that the first websites that are displayed are expected to be the most relevant ones. Mathematically spoken, the search-engine has to solve a ranking problem which is done by the PageRank algorithm (Page et al. [1999]) for Google. In their seminal paper (Clémençon et al. [2008]), Clémençon and coauthors proposed a statistical framework for ranking problems and proved that the common approach of empirical risk minimization is indeed suitable for ranking problems. Although there already existed ranking techniques, most of them indeed follow the ERM principle and can directly be embedded into the framework of Clémençon et al. [2008].

SMX Overtime: Here's how to make SEO gains through data science - Search Engine Land


I am a senior data scientist at LinkedIn working on SEO and guest experience. I presented at SMX London last month about how to apply data science in SEO. The session covered topics including metrics, A/B testing, SEO vs. SEM cannibalization testing and machine learning for content quality. Here are a few questions from session attendees with my responses. For A/B testing, do you use any specific tools/processes?

Opening Up the Black Box: Auditing Google's Top Stories Algorithm

AAAI Conferences

Auditing algorithms has emerged as a methodology for holding algorithms accountable by testing whether they are fair. This process often relies on the repeated use of a platform to record inputs and their corresponding outputs. For example, to audit Google search, one repeatedly inputs queries and captures the received search pages. The goal is then to discover, in the collected data, patterns that will reveal the ``secrets'' of algorithmic decision making. This knowledge discovery process makes some algorithm auditing tasks great applications for data mining techniques. In this paper, we introduce one particular algorithm audit, that of Google's Top stories. We describe the process of data collection, exploration, and analysis for this application and share some of the gleaned insights. Concretely, our analysis suggests that Google might be trying to burst the famous ``filter bubble'' by choosing less known publishers for the 3rd position in the Top stories.

Fake News: A Survey of Research, Detection Methods, and Opportunities Artificial Intelligence

The explosive growth in fake news and its erosion to democracy, justice, and public trust has increased the demand for fake news analysis, detection and intervention. This survey comprehensively and systematically reviews fake news research. The survey identifies and specifies fundamental theories across various disciplines, e.g., psychology and social science, to facilitate and enhance the interdisciplinary research of fake news. Current fake news research is reviewed, summarized and evaluated. These studies focus on fake news from four perspective: (1) the false knowledge it carries, (2) its writing style, (3) its propagation patterns, and (4) the credibility of its creators and spreaders. We characterize each perspective with various analyzable and utilizable information provided by news and its spreaders, various strategies and frameworks that are adaptable, and techniques that are applicable. By reviewing the characteristics of fake news and open issues in fake news studies, we highlight some potential research tasks at the end of this survey.