Information Retrieval
Language models like GPT-3 could herald a new type of search engine
Now a team of Google researchers has published a proposal for a radical redesign that throws out the ranking approach and replaces it with a single large AI language model, such as BERT or GPT-3--or a future version of them. The idea is that instead of searching for information in a vast list of web pages, users would ask questions and have a language model trained on those pages answer them directly. Search engines have become faster and more accurate, even as the web has exploded in size. AI is now used to rank results, and Google uses BERT to understand search queries better. Yet beneath these tweaks, all mainstream search engines still work the same way they did 20 years ago: web pages are indexed by crawlers (software that reads the web nonstop and maintains a list of everything it finds), results that match a user's query are gathered from this index, and the results are ranked.
Societal Biases in Retrieved Contents: Measurement Framework and Adversarial Mitigation for BERT Rankers
Rekabsaz, Navid, Kopeinik, Simone, Schedl, Markus
Societal biases resonate in the retrieved contents of information retrieval (IR) systems, resulting in reinforcing existing stereotypes. Approaching this issue requires established measures of fairness in respect to the representation of various social groups in retrieval results, as well as methods to mitigate such biases, particularly in the light of the advances in deep ranking models. In this work, we first provide a novel framework to measure the fairness in the retrieved text contents of ranking models. Introducing a ranker-agnostic measurement, the framework also enables the disentanglement of the effect on fairness of collection from that of rankers. To mitigate these biases, we propose AdvBert, a ranking model achieved by adapting adversarial bias mitigation for IR, which jointly learns to predict relevance and remove protected attributes. We conduct experiments on two passage retrieval collections (MSMARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking), which we extend by fairness annotations of a selected subset of queries regarding gender attributes. Our results on the MSMARCO benchmark show that, (1) all ranking models are less fair in comparison with ranker-agnostic baselines, and (2) the fairness of Bert rankers significantly improves when using the proposed AdvBert models. Lastly, we investigate the trade-off between fairness and utility, showing that we can maintain the significant improvements in fairness without any significant loss in utility.
A Conversational Agent System for Dietary Supplements Use
Singh, Esha, Bompelli, Anu, Wan, Ruyuan, Bian, Jiang, Pakhomov, Serguei, Zhang, Rui
Conversational agent (CA) systems have been applied to healthcare domain, but there is no such a system to answer consumers regarding DS use, although widespread use of DS. In this study, we develop the first CA system for DS use. Methods: Our CA system for DS use developed on the MindeMeld framework, consists of three components: question understanding, DS knowledge base, and answer generation. We collected and annotated 1509 questions to develop natural language understanding module (e.g., question type classifier, named entity recognizer) which was then integrated into MindMeld framework. CA then queries the DS knowledge base (i.e., iDISK) and generates answers using rule-based slot filling techniques. We evaluated algorithms of each component and the CA system as a whole. Results: CNN is the best question classifier with F1 score of 0.81, and CRF is the best named entity recognizer with F1 score of 0.87. The system achieves an overall accuracy of 81% and an average score of 1.82 with succ@3 score as 76.2% and succ@2 as 66% approximately. Conclusion: This study develops the first CA system for DS use using MindMeld framework and iDISK domain knowledge base.
What is Neural Search? - KDnuggets
TL;DR: Neural Search is a new approach to retrieving information using neural networks. Traditional techniques to search typically meant writing rules to "understand" the data being searched and return the best results. But with neural search, developers don't need to wrack their brains for these rules; The system learns the rules by itself and gets better as it goes along. Even developers who don't know machine learning can quickly build a search engine using open-source frameworks such as Jina. There is a massive amount of data on the web; how can we effectively search through it for relevant information?
MS MARCO: Benchmarking Ranking Models in the Large-Data Regime
Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Campos, Daniel, Lin, Jimmy
Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field. However, the goal is not simply to identify which run is "best", achieving the top score. The goal is to move the field forward by developing new robust techniques, that work in many different settings, and are adopted in research and practice. This paper uses the MS MARCO and TREC Deep Learning Track as our case study, comparing it to the case of TREC ad hoc ranking in the 1990s. We show how the design of the evaluation effort can encourage or discourage certain outcomes, and raising questions about internal and external validity of results. We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls. We summarize the progress of the effort so far, and describe our desired end state of "robust usefulness", along with steps that might be required to get us there.
Why killing your content marketing makes the most sense - Search Engine Watch
The problem is, simply put, out of control. Just because a company or individual can create and distribute content on a platform, doesn't mean they should. I've had the opportunity to analyze content marketing strategies from huge brands, desperately trying to build audiences online leveraging content marketing. In almost every case, each one made the same mistake. When an organization decides to fund a content marketing strategy, the initial stages are always exciting.
A Unified Transferable Model for ML-Enhanced DBMS
Wu, Ziniu, Yang, Peilun, Yu, Pei, Zhu, Rong, Han, Yuxing, Li, Yaliang, Lian, Defu, Zeng, Kai, Zhou, Jingren
Recently, the database management system (DBMS) community has witnessed the power of machine learning (ML) solutions for DBMS tasks. Despite their promising performance, these existing solutions can hardly be considered satisfactory. First, these ML-based methods in DBMS are not effective enough because they are optimized on each specific task, and cannot explore or understand the intrinsic connections between tasks. Second, the training process has serious limitations that hinder their practicality, because they need to retrain the entire model from scratch for a new DB. Moreover, for each retraining, they require an excessive amount of training data, which is very expensive to acquire and unavailable for a new DB. We propose to explore the transferabilities of the ML methods both across tasks and across DBs to tackle these fundamental drawbacks. In this paper, we propose a unified model MTMLF that uses a multi-task training procedure to capture the transferable knowledge across tasks and a pretrain finetune procedure to distill the transferable meta knowledge across DBs. We believe this paradigm is more suitable for cloud DB service, and has the potential to revolutionize the way how ML is used in DBMS. Furthermore, to demonstrate the predicting power and viability of MTMLF, we provide a concrete and very promising case study on query optimization tasks. Last but not least, we discuss several concrete research opportunities along this line of work.
Retrieving Complex Tables with Multi-Granular Graph Representation Learning
Wang, Fei, Sun, Kexuan, Chen, Muhao, Pujara, Jay, Szekely, Pedro
The task of natural language table retrieval (NLTR) seeks to retrieve semantically relevant tables based on natural language queries. Existing learning systems for this task often treat tables as plain text based on the assumption that tables are structured as dataframes. However, tables can have complex layouts which indicate diverse dependencies between subtable structures, such as nested headers. As a result, queries may refer to different spans of relevant content that is distributed across these structures. Moreover, such systems fail to generalize to novel scenarios beyond those seen in the training set. Prior methods are still distant from a generalizable solution to the NLTR problem, as they fall short in handling complex table layouts or queries over multiple granularities. To address these issues, we propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning. In our framework, a table is first converted into a tabular graph, with cell nodes, row nodes and column nodes to capture content at different granularities. Then the tabular graph is input to a Graph Transformer model that can capture both table cell content and the layout structures. To enhance the robustness and generalizability of the model, we further incorporate a self-supervised pre-training task based on graph-context matching. Experimental results on two benchmarks show that our method leads to significant improvements over the current state-of-the-art systems. Further experiments demonstrate promising performance of our method on cross-dataset generalization, and enhanced capability of handling complex tables and fulfilling diverse query intents. Code and data are available at https://github.com/FeiWang96/GTR.
One Model to Rule them All: Towards Zero-Shot Learning for Databases
Hilprecht, Benjamin, Binnig, Carsten
And unfortunately, the training data collection needs to be repeated for every new database that needs to be supported. In this paper, we present our vision of so called zero-shot learning To reduce the high cost of training data collection, reinforcement for databases which is a new learning approach for database learning (RL) has been used to execute training queries [10, 17, 18, components. Zero-shot learning for databases is inspired by recent 34] in a more targeted manner (i.e., letting the RL agent decide advances in transfer learning of models such as GPT-3 and can which queries to execute next). However, even with reinforcement support a new database out-of-the box without the need to train a learning still a large amount of training queries needs to be executed new model. As a first concrete contribution in this paper, we show for learning a model. Moreover, training the model is not a onetime the feasibility of zero-shot learning for the task of physical cost effort since similar to workload-driven approaches the learning estimation and present very promising initial results. Moreover, procedure needs to be repeated for every new database at hand. as a second contribution we discuss the core challenges related to A different direction that has thus been proposed to avoid the zero-shot learning for databases and present a roadmap to extend expensive training data collection by running queries on a new zero-shot learning towards many other tasks beyond cost estimation database are so called data-driven approaches [11, 31, 32] that learn or even beyond classical database systems and workloads.
MathBERT: A Pre-Trained Model for Mathematical Formula Understanding
Peng, Shuai, Yuan, Ke, Gao, Liangcai, Tang, Zhi
Large-scale pre-trained models like BERT, have obtained a great success in various Natural Language Processing (NLP) tasks, while it is still a challenge to adapt them to the math-related tasks. Current pre-trained models neglect the structural features and the semantic correspondence between formula and its context. To address these issues, we propose a novel pre-trained model, namely \textbf{MathBERT}, which is jointly trained with mathematical formulas and their corresponding contexts. In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas. We conduct various experiments on three downstream tasks to evaluate the performance of MathBERT, including mathematical information retrieval, formula topic classification and formula headline generation. Experimental results demonstrate that MathBERT significantly outperforms existing methods on all those three tasks. Moreover, we qualitatively show that this pre-trained model effectively captures the semantic-level structural information of formulas. To the best of our knowledge, MathBERT is the first pre-trained model for mathematical formula understanding.