Information Retrieval
An Embedding-Based Grocery Search Model at Instacart
Xie, Yuqing, Na, Taesik, Xiao, Xiao, Manchanda, Saurav, Rao, Young, Xu, Zhihong, Shu, Guanghua, Vasiete, Esther, Tenneti, Tejaswi, Wang, Haixun
The key to e-commerce search is how to best utilize the large yet noisy log data. In this paper, we present our embedding-based model for grocery search at Instacart. The system learns query and product representations with a two-tower transformer-based encoder architecture. To tackle the cold-start problem, we focus on content-based features. To train the model efficiently on noisy data, we propose a self-adversarial learning method and a cascade training method. AccOn an offline human evaluation dataset, we achieve 10% relative improvement in RECALL@20, and for online A/B testing, we achieve 4.1% cart-adds per search (CAPS) and 1.5% gross merchandise value (GMV) improvement. We describe how we train and deploy the embedding based search model and give a detailed analysis of the effectiveness of our method.
Representing Social Networks as Dynamic Heterogeneous Graphs
Maleki, Negar, Padamanabhan, Balaji, Dutta, Kaushik
Graph representations for real-world social networks in the past have missed two important elements: the multiplexity of connections as well as representing time. To this end, in this paper, we present a new dynamic heterogeneous graph representation for social networks which includes time in every single component of the graph, i.e., nodes and edges, each of different types that captures heterogeneity. We illustrate the power of this representation by presenting four time-dependent queries and deep learning problems that cannot easily be handled in conventional homogeneous graph representations commonly used. As a proof of concept we present a detailed representation of a new social media platform (Steemit), which we use to illustrate both the dynamic querying capability as well as prediction tasks using graph neural networks (GNNs). The results illustrate the power of the dynamic heterogeneous graph representation to model social networks. Given that this is a relatively understudied area we also illustrate opportunities for future work in query optimization as well as new dynamic prediction tasks on heterogeneous graph structures.
Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation
Recently introduced transformer-based article encoders (TAEs) designed to produce similar vector representations for mutually related scientific articles have demonstrated strong performance on benchmark datasets for scientific article recommendation. However, the existing benchmark datasets are predominantly focused on single domains and, in some cases, contain easy negatives in small candidate pools. Evaluating representations on such benchmarks might obscure the realistic performance of TAEs in setups with thousands of articles in candidate pools. In this work, we evaluate TAEs on large benchmarks with more challenging candidate pools. We compare the performance of TAEs with a lexical retrieval baseline model BM25 on the task of citation recommendation, where the model produces a list of recommendations for citing in a given input article. We find out that BM25 is still very competitive with the state-of-the-art neural retrievers, a finding which is surprising given the strong performance of TAEs on small benchmarks. As a remedy for the limitations of the existing benchmarks, we propose a new benchmark dataset for evaluating scientific article representations: Multi-Domain Citation Recommendation dataset (MDCR), which covers different scientific fields and contains challenging candidate pools.
How AI Writing Tools are Revolutionizing Content Creation (2022)
Search engines are constantly evolving, and as a result, the way we create and consume content is also changing. In particular, the rise of artificial intelligence (AI) writing tools is revolutionizing the content creation process. AI writing software is now being used by bloggers and businesses to create high-quality content quickly and easily. This software can analyze data and find trends to help you write about what's popular right now. It can also help you come up with catchy headlines and create drafts that are ready for publishing. AI writing tools have improved a great deal over the past few years and now they can help with writing articles, digital ad copy, blog post ideas, youtube video descriptions, and Google ads all fast and in multiple languages. In this article, we'll discuss how AI writing software is changing the way bloggers and businesses create content and answer some frequently asked questions about this technology. AI writing tools are computer programs that can generate written content. AI tools can be used to create blog articles, website content, or even sales letters. Most AI tools use natural language processing (NLP) to understand the topic and then generate relevant content. AI writing tools can save you a lot of time by quickly generating high-quality content. Just enter a few keywords and the AI tool will do the rest.
Code Compliance Assessment as a Learning Problem
Sawant, Neela, Sengamedu, Srinivasan H.
Manual code reviews and static code analyzers are the traditional mechanisms to verify if source code complies with coding policies. However, these mechanisms are hard to scale. We formulate code compliance assessment as a machine learning (ML) problem, to take as input a natural language policy and code, and generate a prediction on the code's compliance, non-compliance, or irrelevance. This can help scale compliance classification and search for policies not covered by traditional mechanisms. We explore key research questions on ML model formulation, training data, and evaluation setup. The core idea is to obtain a joint code-text embedding space which preserves compliance relationships via the vector distance of code and policy embeddings. As there is no task-specific data, we re-interpret and filter commonly available software datasets with additional pre-training and pre-finetuning tasks that reduce the semantic gap. We benchmarked our approach on two listings of coding policies (CWE and CBP). This is a zero-shot evaluation as none of the policies occur in the training set. On CWE and CBP respectively, our tool Policy2Code achieves classification accuracies of (59%, 71%) and search MRR of (0.05, 0.21) compared to CodeBERT with classification accuracies of (37%, 54%) and MRR of (0.02, 0.02). In a user study, 24% Policy2Code detections were accepted compared to 7% for CodeBERT.
Apple could lose $15B if DOJ forces Google to stop paying to be iPhone's default search engine
Apple stands to lose up to $15 billion a year if the Justice Department forces Google to stop paying the company to be the default search engine on all iPhones - as regulators question the legality of the longtime arrangement. Anytime iPhone users open a web browser to enter a search query, it always defaults to Google. Even though anyone can change this setting, almost no one does, resulting in a huge amount of traffic (and ad revenue) to Google from over a billion iPhone users worldwide. Analysts from Bernstein estimated that Google's payment to Apple would increase to $15 billion in 2021 and as high as $18-$20 billion this year, reports 9to5Mac. The contracts are the basis of the DOJ's antitrust against the California-based company, which began in the closing days of the Trump administration and won't head to trial until sometime in 2023 Last year, Apple's total gross profit was over $152 billion - so losing the Google payments would shave at least 10% off.
MICO: Selective Search with Mutual Information Co-training
Wang, Zhanyu, Zhang, Xiao, Yun, Hyokun, Teo, Choon Hui, Chilimbi, Trishul
In contrast to traditional exhaustive search, selective search first clusters documents into several groups before all the documents are searched exhaustively by a query, to limit the search executed within one group or only a few groups. Selective search is designed to reduce the latency and computation in modern large-scale search systems. In this study, we propose MICO, a Mutual Information CO-training framework for selective search with minimal supervision using the search logs. After training, MICO does not only cluster the documents, but also routes unseen queries to the relevant clusters for efficient retrieval. In our empirical experiments, MICO significantly improves the performance on multiple metrics of selective search and outperforms a number of existing competitive baselines.
Share the Tensor Tea: How Databases can Leverage the Machine Learning Ecosystem
Asada, Yuki, Fu, Victor, Gandhi, Apurva, Gemawat, Advitya, Zhang, Lihao, He, Dong, Gupta, Vivek, Nosakhare, Ehi, Banda, Dalitso, Sen, Rathijit, Interlandi, Matteo
We demonstrate Tensor Query Processor (TQP): a query processor that automatically compiles relational operators into tensor programs. By leveraging tensor runtimes such as PyTorch, TQP is able to: (1) integrate with ML tools (e.g., Pandas for data ingestion, Tensorboard for visualization); (2) target different hardware (e.g., CPU, GPU) and software (e.g., browser) backends; and (3) end-to-end accelerate queries containing both relational and ML operators. TQP is generic enough to support the TPC-H benchmark, and it provides performance that is comparable to, and often better than, that of specialized CPU and GPU query processors.
Extracting a Knowledge Base of COVID-19 Events from Social Media
Zong, Shi, Baheti, Ashutosh, Xu, Wei, Ritter, Alan
In this paper, we present a manually annotated corpus of 10,000 tweets containing public reports of five COVID-19 events, including positive and negative tests, deaths, denied access to testing, claimed cures and preventions. We designed slot-filling questions for each event type and annotated a total of 31 fine-grained slots, such as the location of events, recent travel, and close contacts. We show that our corpus can support fine-tuning BERT-based classifiers to automatically extract publicly reported events and help track the spread of a new disease. We also demonstrate that, by aggregating events extracted from millions of tweets, we achieve surprisingly high precision when answering complex queries, such as "Which organizations have employees that tested positive in Philadelphia?" We will release our corpus (with user-information removed), automatic extraction models, and the corresponding knowledge base to the research community.