Goto

Collaborating Authors

 Information Retrieval


Why You Should Develop AI-Powered Visual Search Solution?

#artificialintelligence

AI-based visual search solution has the potential to change our interactions with the world around us. , Quytech develops an AI-powered visual search solution to enhance the customer experience.


Boosting Search Engines with Interactive Agents

arXiv.org Artificial Intelligence

Can machines learn to use a search engine as an interactive tool for finding information? That would have far reaching consequences for making the world's knowledge more accessible. This paper presents first steps in designing agents that learn meta-strategies for contextual query refinements. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based generative language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that can learn interactive search strategies completely from scratch. In both cases, we obtain significant improvements over one-shot search with a strong information retrieval baseline. Finally, we provide an in-depth analysis of the learned search policies.


Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback

arXiv.org Artificial Intelligence

Retrieval with dense, fully-learned representations has the potential to address some fundamental challenges in sparse retrieval. Dense retrieval systems conduct first-stage retrieval using embedded For example, vocabulary mismatch can be solved if the embeddings representations and simple similarity metrics to match a query accurately capture the information need behind a query and to documents. Its effectiveness depends on encoded embeddings maps it to relevant documents. However, decades of IR research to capture the semantics of queries and documents, a challenging demonstrates that inferring a user's search intent from a concise task due to the shortness and ambiguity of search queries. This and often ambiguous search query is challenging [7]. Even with paper proposes ANCE-PRF, a new query encoder that uses pseudo powerful pre-trained language models, it is unrealistic to expect an relevance feedback (PRF) to improve query representations for encoder to perfectly embed the underlying information need from dense retrieval. ANCE-PRF uses a BERT encoder that consumes a few query terms.



Keyword Extraction API - BytesView

#artificialintelligence

Keyword extraction also known as keyword detection is a machine learning technique that can help you automate the identification and extraction of relevant information from unstructured text data. BytesView's efficient keyword extraction tool can analyze unstructured text including customer feedback, emails, surveys, social media posts, etc. Pre-define tags to identify topical content, business intelligence, customer opinions, and recurring tickets.


Google Continues To Pay Apple Billions To Remain Safari's Default Search Engine

#artificialintelligence

According to a report from Ped30, they have gotten their hands on an investor's note from Bernstein's analysts where they are claiming that Google is now paying Apple as much as $15 billion in 2021 to remain Safari's default search. This is higher than what Google had paid Apple in 2020 at $10 billion, and it seems that this figure is only expected to grow. According to the analysts, "We now estimate that Google's payments to AAPL to be the default search engine on iOS were $10B in FY 20, higher than our prior published model estimate of $8B. Recent disclosures in Apple's public filings as well as a bottom-up analysis of Google's TAC (traffic acquisition costs) payments each point us to this figure…We now forecast that Google's payments to Apple might be nearly $15B in FY 21, contribute an amazing 850 bps to Services growth YoY, and amount to 9% of company gross profits." They go on to estimate that this figure will jump to $18-$20 billion in 2022, and the reason behind the increase in payments is because Google wants to ensure that Microsoft (and other competitors) don't outbid them.


sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification

arXiv.org Machine Learning

Multiclass multilabel classification refers to the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of that multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). Empirically, these methods have been reported to achieve good performance on different metrics (F1 score, Recall, Precision, etc.). Theoretically though, the multilabel classification reductions does not accommodate for the prediction of varying numbers of labels per example and the underlying losses are distant estimates of the performance metrics. We propose a loss function, sigmoidF1. It is an approximation of the F1 score that (I) is smooth and tractable for stochastic gradient descent, (II) naturally approximates a multilabel metric, (III) estimates label propensities and label counts. More generally, we show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on different text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. In our experiments, we embed the sigmoidF1 loss in a classification head that is attached to state-of-the-art efficient pretrained neural networks MobileNetV2 and DistilBERT. Our experiments show that sigmoidF1 outperforms other loss functions on four datasets and several metrics. These results show the effectiveness of using inference-time metrics as loss function at training time in general and their potential on non-trivial classification problems like multilabel classification.


QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction

arXiv.org Artificial Intelligence

We study the problem of query attribute value extraction, which aims to identify named entities from user queries as diverse surface form attribute values and afterward transform them into formally canonical forms. Such a problem consists of two phases: {named entity recognition (NER)} and {attribute value normalization (AVN)}. However, existing works only focus on the NER phase but neglect equally important AVN. To bridge this gap, this paper proposes a unified query attribute value extraction system in e-commerce search named QUEACO, which involves both two phases. Moreover, by leveraging large-scale weakly-labeled behavior data, we further improve the extraction performance with less supervision cost. Specifically, for the NER phase, QUEACO adopts a novel teacher-student network, where a teacher network that is trained on the strongly-labeled data generates pseudo-labels to refine the weakly-labeled data for training a student network. Meanwhile, the teacher network can be dynamically adapted by the feedback of the student's performance on strongly-labeled data to maximally denoise the noisy supervisions from the weak labels. For the AVN phase, we also leverage the weakly-labeled query-to-attribute behavior data to normalize surface form attribute values from queries into canonical forms from products. Extensive experiments on a real-world large-scale E-commerce dataset demonstrate the effectiveness of QUEACO.


Azure Synapse Analytics Serverless SQL Pool Guidelines

#artificialintelligence

With the introduction of the serverless SQL pool as a part of Azure Synapse Analytics, Microsoft has provided a very cost-efficient and convenient way to drive value from data residing in lakes using simple T-SQL statements. It enables you to easily build logical analytical models by querying and joining data across heterogeneous sources making the development of complex data integration pipelines obsolete in many cases. To use it, you don't even need to explicitly provision it beforehand due to its serverless nature, it is per default part of an Azure Synapse Analytics workspace. All you have to do is query data in an on-demand fashion in which you get charged according to the amount of data your queries need to process. Yet, the flexibility provided in terms of how data can be stored and queried require you to stick to some conventions for properly applying all its features and functionalities. Otherwise, the once promising serverless query engine can end up causing lots of costs together with a poor performance.


Towards Personalized and Human-in-the-Loop Document Summarization

arXiv.org Artificial Intelligence

The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.