Goto

Collaborating Authors

 Information Retrieval


Think Search Is Solved? Think Again

#artificialintelligence

Search is one of the oldest technologies around. Ever since the dawn of the World Wide Web, a search engine has been the portal through which we obtain information. The search for a better search engine index kick started the Hadoop craze, and it continues to drive Google to push the limits of technology. But don't for a second think that search has been solved. "Search is far from being solved. It's the hardest thing we do. It's the hardest thing everybody does."


Can Artificial Intelligence be friends of Humans - OnPassive

#artificialintelligence

A question arises that how will it become self-aware and realize that humans stand in its way? Artificial Intelligence is the capability of a digital computer or computer-controlled robot that performs a task commonly associated with intelligent beings. Robots and AI allow producing things faster, better, and cheaper with higher consistency. AI is very disruptive for low-cost countries that provide low-cost manufacturing for international companies since robots do this cheaply. It is also disruptive to countries with higher salary levels, but not at the same level as low-cost countries. Our forefathers had the same concern with industrial revolutions.


Code Search Intent Classification Using Weak Supervision

arXiv.org Artificial Intelligence

Developers use search for various tasks such as finding code, documentation, debugging information, etc. In particular, web search is heavily used by developers for finding code examples and snippets during the coding process. Recently, natural language based code search has been an active area of research. However, the lack of real-world large-scale datasets is a significant bottleneck. In this work, we propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages. We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine and show that the CNN based model can achieve an accuracy of 77% and 76% for C# and Java respectively. Furthermore, we are also releasing the first large-scale real-world dataset of code search queries mined from Bing web search engine. We hope that the dataset will aid future research on code search.


Ghostery's New Search Engine Will Be Entirely Ad-Free

WIRED

The internet runs on advertising, and that includes search engines. Google brought in $26 billion of search revenue in the most recent quarter alone. As that business has grown, it's reshaped what search looks like. Year after year, ads have gobbled up more space on its results pages, pushing organic results further out of view. Which is why using Ghostery's new ad-free search engine and desktop browser, even in their pre-beta form, feels at once like a throwback to a simpler internet and a glimpse of a future where browsing that puts results ahead of revenue is once again possible.


Non-Linear Multiple Field Interactions Neural Document Ranking

arXiv.org Artificial Intelligence

Ranking tasks are usually based on the text of the main body of the page and the actions (clicks) of users on the page. There are other elements that could be leveraged to better contextualise the ranking experience (e.g. text in other fields, query made by the user, images, etc). We present one of the first in-depth analyses of field interaction for multiple field ranking in two separate datasets. While some works have taken advantage of full document structure, some aspects remain unexplored. In this work we build on previous analyses to show how query-field interactions, non-linear field interactions, and the architecture of the underlying neural model affect performance.


Instagram finally lets you search for posts by keyword

Engadget

Enhancements to Guides isn't the only thing Instagram users can look forward to checking out today. The app has started rolling out a new, more robust search tool to users in the US, UK, Canada, Ireland and two other English-speaking countries that allows you to look for posts using keywords. If you're someone who wants to grow their followers, this change should ideally help with discoverability since you won't need to be so exacting with the hashtags you add to a post. As for how the new tool goes about surfacing the content it does, an Instagram spokesperson told The Verge the new algorithm considers several factors, including when someone shared the post, the accompanying caption and the photo or video that's on display. Instagram also says it's using machine learning to put forward "the highest quality content that's relevant to you."


FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation

arXiv.org Artificial Intelligence

Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over simplify the models only using independent factorization which leads to inaccurate estimates and sub optimal query plans, or over-complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus subsumes all existing CardEst models and dovetails their advantages. FLAT supports efficient online probability computation in near liner time on the underlying FSPN model, and provides effective offline model construction. It can estimate cardinality for both single table queries and multi-table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods on well-known benchmarks: FLAT achieves 1 to 5 orders of magnitude better accuracy, 1 to 3 orders of magnitude faster probability computation speed (around 0.2ms) and 1 to 2 orders of magnitude lower storage cost (only tens of KB).


Practical Guide to Entity Resolution -- part 4

#artificialintelligence

This is part 4 of a mini-series on entity resolution. Candidate pair generation is a fairly straightforward part of ER, as it is essentially a self join on the blocking keys. The next step after candidate pair generation, is to score the candidate pair match likelihood. This is crucial to removing non-matches and creating the final resolved entities. This step is again fairly open ended and one can be very creative about the specific scoring functions and features to implement.


Artificial Intelligence Decision Support for Medical Triage

arXiv.org Artificial Intelligence

Applying state-of-the-art machine learning and natural language processing on approximately one million of teleconsultation records, we developed a triage system, now certified and in use at the largest European telemedicine provider. The system evaluates care alternatives through interactions with patients via a mobile application. Reasoning on an initial set of provided symptoms, the triage application generates AIpowered, personalized questions to better characterize the problem and recommends the most appropriate point of care and time frame for a consultation. The underlying technology was developed to meet the needs for performance, transparency, user acceptance and ease of use, central aspects to the adoption of AIbased decision support systems. Providing such remote guidance at the beginning of the chain of care has significant potential for improving cost efficiency, patient experience and outcomes. Being remote, always available and highly scalable, this service is fundamental in high demand situations, such as the current COVID-19 outbreak. Introduction Shortage of physicians and increasing healthcare costs have created a need for digital solutions to better optimize medical resources. In addition, patient expectations for mobile, fast and easy 24/7 access to doctors and health services drive the development of patient-centered solutions.


Sampling-Decomposable Generative Adversarial Recommender

arXiv.org Artificial Intelligence

Recommendation techniques are important approaches for alleviating information overload. Being often trained on implicit user feedback, many recommenders suffer from the sparsity challenge due to the lack of explicitly negative samples. The GAN-style recommenders (i.e., IRGAN) addresses the challenge by learning a generator and a discriminator adversarially, such that the generator produces increasingly difficult samples for the discriminator to accelerate optimizing the discrimination objective. However, producing samples from the generator is very time-consuming, and our empirical study shows that the discriminator performs poor in top-k item recommendation. To this end, a theoretical analysis is made for the GAN-style algorithms, showing that the generator of limit capacity is diverged from the optimal generator. This may interpret the limitation of discriminator's performance. Based on these findings, we propose a Sampling-Decomposable Generative Adversarial Recommender (SD-GAR). In the framework, the divergence between some generator and the optimum is compensated by self-normalized importance sampling; the efficiency of sample generation is improved with a sampling-decomposable generator, such that each sample can be generated in O(1) with the Vose-Alias method. Interestingly, due to decomposability of sampling, the generator can be optimized with the closed-form solutions in an alternating manner, being different from policy gradient in the GAN-style algorithms. We extensively evaluate the proposed algorithm with five real-world recommendation datasets. The results show that SD-GAR outperforms IRGAN by 12.4% and the SOTA recommender by 10% on average. Moreover, discriminator training can be 20x faster on the dataset with more than 120K items.