Information Retrieval
Apple Is Quietly Working On Its Own Search Engine To Take On Google
Apple may be stealthily developing its own search engine, as Google faces a lawsuit from the U.S. antitrust authorities regarding the search engine giant's agreements with companies to be the default search tool. In the newest operating system update for the iPhone, the iOS 14, Apple has started showing its own search results and direct links to websites when users search from their home screen. In its updated version, iOS 14 does not use Google for many of its search functions, as it previously used to. The search window that appears in iPhones when users swipe right now compiles Apple-generated search suggestions rather than Google results. Earlier this week, the U.S. Department of Justice, in a landmark lawsuit said, Google is monopolizing the search space by entering into multi-billion dollar deals with mobile companies like Apple, Motorola, and network carriers like AT&T and Verizon, to be the default search engine on devices.
Introduction to Machine Learning
Most readers will be familiar with the concept of web page ranking. That is the process of submitting a query to a search engine, which then finds web pages relevant to the query and which returns them in their order of relevance. See e.g. Figure below for an example of the query results for "Machine Learning". That is, the search engine returns a sorted list of web pages given a query. To achieve this goal, a search engine needs to'know' which pages are relevant and which pages match the query.
Active Classification with Uncertainty Comparison Queries
Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The \textit{positivity comparison oracle} is used to provide feedback on which is more likely to be positive given a pair of data points. Because it is impossible to infer accurate labels using this oracle alone \textit{without knowing the classification threshold}, existing methods still rely on the traditional \textit{explicit labeling oracle}, which directly answers the label given a data point. Existing methods conduct sorting on all data points and use explicit labeling oracle to find the classification threshold. The current methods, however, have two drawbacks: (1) they needs unnecessary sorting for label inference; (2) quick sort is naively adapted to noisy feedback and negatively affects practical performance. In order to avoid this inefficiency and acquire information of the classification threshold, we propose a new pairwise comparison oracle concerning uncertainties. This oracle receives two data points as input and answers which one has higher uncertainty. We then propose an efficient adaptive labeling algorithm using the proposed oracle and the positivity comparison oracle. In addition, we also address the situation where the labeling budget is insufficient compared to the dataset size, which can be dealt with by plugging the proposed algorithm into an active learning algorithm. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed algorithm theoretically and empirically.
A Clarifying Question Selection System from NTES_ALONG in Convai3 Challenge
This paper presents the participation of NTES\_ALONG team for the ClariQ challenge at Search-oriented Conversational AI (SCAI) EMNLP workshop in 2020. The challenge asks for a complete conversational information retrieval system that can understanding and generating clarification questions. We propose a clarifying question selection system which consists of response understanding, candidate question recalling and clarifying question ranking. We fine-tune a RoBERTa model to understand user's responses and use an enhanced BM25 model to recall the candidate questions. In clarifying question ranking stage, we reconstruct the training dataset and propose two models based on ELECTRA. Finally we ensemble the models by summing up their output probabilities and choose the question with the highest probability as the clarification question. Experiments show that our ensemble ranking model outperforms in the document relevance task and achieves the best recall@[20,30] metrics in question relevance task.
QBSUM: a Large-Scale Query-Based Document Summarization Dataset from Real-world Applications
Zhao, Mingjun, Yan, Shengli, Liu, Bang, Zhong, Xinwang, Hao, Qian, Chen, Haolan, Niu, Di, Long, Bowei, Guo, Weidong
Query-based document summarization aims to extract or generate a summary of a document which directly answers or is relevant to the search query. It is an important technique that can be beneficial to a variety of applications such as search engines, document-level machine reading comprehension, and chatbots. Currently, datasets designed for query-based summarization are short in numbers and existing datasets are also limited in both scale and quality. Moreover, to the best of our knowledge, there is no publicly available dataset for Chinese query-based document summarization. In this paper, we present QBSUM, a high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization. We also propose multiple unsupervised and supervised solutions to the task and demonstrate their high-speed inference and superior performance via both offline experiments and online A/B tests. The QBSUM dataset is released in order to facilitate future advancement of this research field.
Query Complexity of k-NN based Mode Estimation
Singhal, Anirudh, Pirojiwala, Subham, Karamchandani, Nikhil
Motivated by the mode estimation problem of an unknown multivariate probability density function, we study the problem of identifying the point with the minimum k-th nearest neighbor distance for a given dataset of n points. We study the case where the pairwise distances are apriori unknown, but we have access to an oracle which we can query to get noisy information about the distance between any pair of points. For two natural oracle models, we design a sequential learning algorithm, based on the idea of confidence intervals, which adaptively decides which queries to send to the oracle and is able to correctly solve the problem with high probability. We derive instance-dependent upper bounds on the query complexity of our proposed scheme and also demonstrate significant improvement over the performance of other baselines via extensive numerical evaluations.
A Survey of Embedding Space Alignment Methods for Language and Knowledge Graphs
Kalinowski, Alexander, An, Yuan
The purpose of this survey is to explore the core techniques and categorizations of methods for aligning low-dimensional embedding spaces. Projecting sparse, high-dimensional data sets into compact, lower-dimensional spaces allows not only for a significant reduction in storage space, but also builds dense representations with many applications. These embedding spaces have become a staple in representation learning ever since their heralded application to natural language in a technique called word2vec, and have replaced traditional machine learning features as easy-to-build, high-quality representations of the source objects. There has been a wealth of study around techniques for embedding objects, such as images, natural language and knowledge graphs, and many research agendas focused on mapping one embedding space to another, either for the purpose of aligning and unifying to a common space, applications to joint downstream tasks or ease of transfer learning. In order to fully leverage these dense representations and translate them across domains and problem spaces, techniques for establishing alignments between them must be developed and understood.
Chile's New Interdisciplinary Institute for Foundational Research on Data
The Millennium Institute for Foundational Research on Dataa (IMFD) started its operations in June 2018, funded by the Millennium Science Initiative of the Chilean National Agency of Research and Development.b IMFD is a joint initiative led by Universidad de Chile and Universidad Católica de Chile, with the participation of five other Chilean universities: Universidad de Concepción, Universidad de Talca, Universidad Técnica Federico Santa María, Universidad Diego Portales, and Universidad Adolfo Ibáñez. IMFD aims to be a reference center in Latin America related to state-of-the-art research on the foundational problems with data, as well as its applications to tackling diverse issues ranging from scientific challenges to complex social problems. As tasks of this kind are interdisciplinary by nature, IMFD gathers a large number of researchers in several areas that include traditional computer science areas such as data management, Web science, algorithms and data structures, privacy and verification, information retrieval, data mining, machine learning, and knowledge representation, as well as some areas from other fields, including statistics, political science, and communication studies. IMFD currently hosts 36 researchers, seven postdoctoral fellows, and more than 100 students.
Keyphrase Extraction with Dynamic Graph Convolutional Networks and Diversified Inference
Zhang, Haoyu, Long, Dingkun, Xu, Guangwei, Xie, Pengjun, Huang, Fei, Wang, Ji
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document. Recently, Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks. The main challenges of Seq2Seq methods lie in acquiring informative latent document representation and better modeling the compositionality of the target keyphrases set, which will directly affect the quality of generated keyphrases. In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously. Concretely, we explore to integrate dependency trees with GCN for latent representation learning. Moreover, the graph structure in our model is dynamically modified during the learning process according to the generated keyphrases. To this end, our approach is able to explicitly learn the relations within the keyphrases collection and guarantee the information interchange between encoder and decoder in both directions. Extensive experiments on various KE benchmark datasets demonstrate the effectiveness of our approach.
Google Paid Apple Billions To Dominate Search On iPhones, Justice Department Says
The Justice Department says Google CEO Sundar Pichai (left) met privately with Apple chief Tim Cook in 2018 to discuss how their two companies could collaborate. The Justice Department says Google CEO Sundar Pichai (left) met privately with Apple chief Tim Cook in 2018 to discuss how their two companies could collaborate. Buried on page 36 of the Justice Department lawsuit accusing Google of abusing its monopoly power is this remarkable figure: $8 billion to $12 billion. That's the hefty sum Google allegedly paid Apple for one of the most prized pieces of real estate in the world of online search: default status on iPhones and all other Apple devices. Justice Department investigators say Apple, which does not have its own search engine, hammered out a multiyear deal making Google the default search engine on all iPhones and other Apple products.