Goto

Collaborating Authors

Information Retrieval


Google is threatening to pull its search engine out of Australia

Washington Post - Technology News

Google and Facebook have been in a long-running fight with Australian politicians, regulators and media companies over whether they should pay news organizations for showing their stories in search results. The battle reached a new level of intensity when a Google executive threatened to pull out of the country during testimony at the Australian Senate.


Google's threat to withdraw its search engine from Australia is chilling to anyone who cares about democracy Peter Lewis

The Guardian

Google's testimony to an Australian Senate committee on Friday threatening to withdraw its search services from Australia is chilling to anyone who cares about democracy. It marks the latest escalation in the globally significant effort to regulate the way the big tech platforms use news content to drive their advertising businesses and the catastrophic impact on the news media across the world. The news bargaining code, which would require Google and Facebook to negotiate a fair price for the use of news content, is the product of an 18-month process driven by the competition regulator. That legislation is currently before the Australian parliament, where a Senate committee is taking final submissions from interested parties. The Google bombshell makes explicit what has been a slowly escalating threat that a binding code would not be tenable.


Google threatens to withdraw search engine from Australia

BBC News

The tech giant says it will remove its main search function from Australia if it passes a new law.


privacy?

USATODAY - Tech Top Stories

DuckDuckGo, a search engine focused on privacy, increased its average number of daily searches by 62% in 2020 as users seek alternatives to impede data tracking. The search engine, founded in 2008, operated nearly 23.7 billion search queries on their platform in 2020, according to their traffic page. On Jan. 11, DuckDuckGo reached its highest number of search queries in one day, with a total of 102,251,307. DuckDuckGo does not track user searches or share personal data with third-party companies. "People are coming to us because they want more privacy, and it's generally spreading through word of mouth," Kamyl Bazbaz, DuckDuckGo vice president of communications, told USA TODAY.


DuckDuckGo search engine increased its traffic by 62% in 2020 as users seek privacy

USATODAY - Tech Top Stories

DuckDuckGo, a search engine focused on privacy, increased its average number of daily searches by 62% in 2020 as users seek alternatives to impede data tracking. The search engine, founded in 2008, operated nearly 23.7 billion search queries on their platform in 2020, according to their traffic page. On Jan. 11, DuckDuckGo reached its highest number of search queries in one day, with a total of 102,251,307. DuckDuckGo does not track user searches or share personal data with third-party companies. "People are coming to us because they want more privacy, and it's generally spreading through word-of-mouth," Kamyl Bazbaz, DuckDuckGo vice president of communications, told USA TODAY.


Graph integration of structured, semistructured and unstructured data for data journalism

arXiv.org Artificial Intelligence

Such a query can be answered currently at a high human effort cost, by inspecting e.g., a JSON list of Assemblée elected officials (available from NosDeputes.fr) and manually connecting the names with those found in a national registry of companies. This considerable effort may still miss connections that could be found if one added information about politicians' and business people's spouses, information sometimes available in public knowledge bases such as DBPedia, or journalists' notes. No single query language can be used on such heterogeneous data; instead, we study methods to query the corpus by specifying some keywords and asking for all the connections that exist, in one or across several data sources, between these keywords. This problem has been studied under the name of keyword search over structured data, in particular for relational databases [49, 27], XML documents [24, 33], RDF graphs [30, 16]. However, most of these works assumed one single source of data, in which connections among nodes are clearly identified. When authors considered several data sources [31], they still assumed that one query answer comes from a single data source. In contrast, the ConnectionLens system [10] answers keyword search queries over arbitrary combinations of datasets and heterogeneous data models, independently produced by actors unaware of each other's existence.


Nested Named Entity Recognition with Partially-Observed TreeCRFs

arXiv.org Artificial Intelligence

Named entity recognition (NER) is a well-studied task in natural language processing. However, the widely-used sequence labeling framework is difficult to detect entities with nested structures. In this work, we view nested NER as constituency parsing with partially-observed trees and model it with partially-observed TreeCRFs. Specifically, we view all labeled entity spans as observed nodes in a constituency tree, and other spans as latent nodes. With the TreeCRF we achieve a uniform way to jointly model the observed and the latent nodes. To compute the probability of partial trees with partial marginalization, we propose a variant of the Inside algorithm, the \textsc{Masked Inside} algorithm, that supports different inference operations for different nodes (evaluation for the observed, marginalization for the latent, and rejection for nodes incompatible with the observed) with efficient parallelized implementation, thus significantly speeding up training and inference. Experiments show that our approach achieves the state-of-the-art (SOTA) F1 scores on the ACE2004, ACE2005 dataset, and shows comparable performance to SOTA models on the GENIA dataset. Our approach is implemented at: \url{https://github.com/FranxYao/Partially-Observed-TreeCRFs}.


Distant-Supervised Slot-Filling for E-Commerce Queries

arXiv.org Artificial Intelligence

Slot-filling refers to the task of annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). These characteristics can then be used by a search engine to return results that better match the query's product intent. Traditional methods for slot-filling require the availability of training data with ground truth slot-annotation information. However, generating such labeled data, especially in e-commerce is expensive and time-consuming because the number of slots increases as new products are added. In this paper, we present distant-supervised probabilistic generative models, that require no manual annotation. The proposed approaches leverage the readily available historical query logs and the purchases that these queries led to, and also exploit co-occurrence information among the slots in order to identify intended product characteristics. We evaluate our approaches by considering how they affect retrieval performance, as well as how well they classify the slots. In terms of retrieval, our approaches achieve better ranking performance (up to 156%) over Okapi BM25. Moreover, our approach that leverages co-occurrence information leads to better performance than the one that does not on both the retrieval and slot classification tasks.


Discriminative Pre-training for Low Resource Title Compression in Conversational Grocery

arXiv.org Artificial Intelligence

The ubiquity of smart voice assistants has made conversational shopping commonplace. This is especially true for low consideration segments like grocery. A central problem in conversational grocery is the automatic generation of short product titles that can be read out fast during a conversation. Several supervised models have been proposed in the literature that leverage manually labeled datasets and additional product features to generate short titles automatically. However, obtaining large amounts of labeled data is expensive and most grocery item pages are not as feature-rich as other categories. To address this problem we propose a pre-training based solution that makes use of unlabeled data to learn contextual product representations which can then be fine-tuned to obtain better title compression even in a low resource setting. We use a self-attentive BiLSTM encoder network with a time distributed softmax layer for the title compression task. We overcome the vocabulary mismatch problem by using a hybrid embedding layer that combines pre-trained word embeddings with trainable character level convolutions. We pre-train this network as a discriminator on a replaced-token detection task over a large number of unlabeled grocery product titles. Finally, we fine tune this network, without any modifications, with a small labeled dataset for the title compression task. Experiments on Walmart's online grocery catalog show our model achieves performance comparable to state-of-the-art models like BERT and XLNet. When fine tuned on all of the available training data our model attains an F1 score of 0.8558 which lags the best performing model, BERT-Base, by 2.78% and XLNet by 0.28% only, while using 55 times lesser parameters than both. Further, when allowed to fine tune on 5% of the training data only, our model outperforms BERT-Base by 24.3% in F1 score.


An End-to-End Solution for Named Entity Recognition in eCommerce Search

arXiv.org Artificial Intelligence

Named entity recognition (NER) is a critical step in modern search query understanding. In the domain of eCommerce, identifying the key entities, such as brand and product type, can help a search engine retrieve relevant products and therefore offer an engaging shopping experience. Recent research shows promising results on shared benchmark NER tasks using deep learning methods, but there are still unique challenges in the industry regarding domain knowledge, training data, and model production. This paper demonstrates an end-to-end solution to address these challenges. The core of our solution is a novel model training framework "TripleLearn" which iteratively learns from three separate training datasets, instead of one training set as is traditionally done. Using this approach, the best model lifts the F1 score from 69.5 to 93.3 on the holdout test data. In our offline experiments, TripleLearn improved the model performance compared to traditional training approaches which use a single set of training data. Moreover, in the online A/B test, we see significant improvements in user engagement and revenue conversion. The model has been live on homedepot.com for more than 9 months, boosting search conversions and revenue. Beyond our application, this TripleLearn framework, as well as the end-to-end process, is model-independent and problem-independent, so it can be generalized to more industrial applications, especially to the eCommerce industry which has similar data foundations and problems.