Information Retrieval
BERT based patent novelty search by training claims to their own description
Freunek, Michael, Bodmer, André
In this paper we present a method to concatenate patent claims to their own description. By applying this method, BERT trains suitable descriptions for claims. Such a trained BERT (claim-to-description- BERT) could be able to identify novelty relevant descriptions for patents. In addition, we introduce a new scoring scheme, relevance scoring or novelty scoring, to process the output of BERT in a meaningful way. We tested the method on patent applications by training BERT on the first claims of patents and corresponding descriptions. BERT's output has been processed according to the relevance score and the results compared with the cited X documents in the search reports. The test showed that BERT has scored some of the cited X documents as highly relevant.
Brave Search is a privacy-first search engine
Browser privacy is a big deal, as Google and other companies use your search data to serve you ads while you surf the web. While most users accept that tradeoff, others who believe strongly in maintaining their own data privacy. If you're one of these, Brave Software can help. On Wednesday the company said it's launching a search engine to compete with Google and Bing, with privacy as its first priority. Brave is buying Tailcat, an open search engine, and will add it to what it's calling Brave Search, a forthcoming search engine.
Brave is developing its own privacy-focused search engine
Privacy-focused browser Brave is working on its own search engine. It has bought Tailcat, an open-source engine created by a team who worked on the defunct anti-tracking browser and search engine Cliqz, to power Brave Search. The company will allow others to use Brave Search tech to build their own search engines. Brave says the search engine will provide an alternative to Google Search and Chrome. It's developing Brave Search using the same principles as its browser, which now has more than 25 million monthly active users.
Data Augmentation for Abstractive Query-Focused Multi-Document Summarization
Pasunuru, Ramakanth, Celikyilmaz, Asli, Galley, Michel, Xiong, Chenyan, Zhang, Yizhe, Bansal, Mohit, Gao, Jianfeng
The progress in Query-focused Multi-Document Summarization (QMDS) has been limited by the lack of sufficient largescale high-quality training datasets. We present two QMDS training datasets, which we construct using two data augmentation methods: (1) transferring the commonly used single-document CNN/Daily Mail summarization dataset to create the QMDSCNN dataset, and (2) mining search-query logs to create the QMDSIR dataset. These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries. To cover both these real summary and query aspects, we build abstractive end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets. We also introduce new hierarchical encoders that enable a more efficient encoding of the query together with multiple documents. Empirical results demonstrate that our data augmentation and encoding methods outperform baseline models on automatic metrics, as well as on human evaluations along multiple attributes.
Making Enterprise Search Personal - Coruzant Technologies
Knowledge management providers are now looking to build systems that are more tailored to the needs of their customers. In technical parlance, this is known as the behavioral model for information retrieval system design. With these models, users search for a product or service, and the results often include related offerings that are better matched to the user's intent. Honing in on this kind of personalization is at the crux of the new experience economy of customer service and the forefront of Enterprise Search advancements. One of the key requirements for forward-looking knowledge management is the capacity to extract data from the typically hundreds and thousands of data silos scattered throughout a company and crawl them to create meaningful insights.
Building a Complete AI Based Search Engine with Elasticsearch, Kubeflow and Katib
Building search systems is hard. Preparing them to work with machine learning is really hard. Developing a complete search engine framework integrated with AI is really really hard. In this post, we'll build a search engine from scratch and discuss on how to further optimize results by adding a machine learning layer using Kubeflow and Katib. This new layer will be capable of retrieving results considering the context of users and is the main focus of this article. As we'll see, thanks to Kubeflow and Katib, final result is rather quite simple, efficient and easy to maintain. To understand the concepts in practice, we'll implement the system with hands-on experience. As it's been built on top of Kubernetes, you can use any infrastructure you like (given appropriate adaptations).
Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification
van Bakel, Ruud, Aleksiev, Teodor, Daza, Daniel, Alivanistos, Dimitrios, Cochez, Michael
Large, heterogeneous datasets are characterized by missing or even erroneous information. This is more evident when they are the product of community effort or automatic fact extraction methods from external sources, such as text. A special case of the aforementioned phenomenon can be seen in knowledge graphs, where this mostly appears in the form of missing or incorrect edges and nodes. Structured querying on such incomplete graphs will result in incomplete sets of answers, even if the correct entities exist in the graph, since one or more edges needed to match the pattern are missing. To overcome this problem, several algorithms for approximate structured query answering have been proposed. Inspired by modern Information Retrieval metrics, these algorithms produce a ranking of all entities in the graph, and their performance is further evaluated based on how high in this ranking the correct answers appear. In this work we take a critical look at this way of evaluation. We argue that performing a ranking-based evaluation is not sufficient to assess methods for complex query answering. To solve this, we introduce Message Passing Query Boxes (MPQB), which takes binary classification metrics back into use and shows the effect this has on the recently proposed query embedding method MPQE.
Unsupervised Meta Learning for One Shot Title Compression in Voice Commerce
Product title compression for voice and mobile commerce is a well studied problem with several supervised models proposed so far. However these models have 2 major limitations; they are not designed to generate compressions dynamically based on cues at inference time, and they do not transfer well to different categories at test time. To address these shortcomings we model title compression as a meta learning problem where we ask can we learn a title compression model given only 1 example compression? We adopt an unsupervised approach to meta training by proposing an automatic task generation algorithm that models the observed label generation process as the outcome of 4 unobserved processes. We create parameterized approximations to each of these 4 latent processes to get a principled way of generating random compression rules, which are treated as different tasks. For our main meta learner, we use 2 models; M1 and M2. M1 is a task agnostic embedding generator whose output feeds into M2 which is a task specific label generator. We pre-train M1 on a novel unsupervised segment rank prediction task that allows us to treat M1 as a segment generator that also learns to rank segments during the meta-training process. Our experiments on 16000 crowd generated meta-test examples show that our unsupervised meta training regime is able to acquire a learning algorithm for different tasks after seeing only 1 example for each task. Further, we show that our model trained end to end as a black box meta learner, outperforms non parametric approaches. Our best model obtains an F1 score of 0.8412, beating the baseline by a large margin of 25 F1 points.
Scaling Creative Inspiration with Fine-Grained Functional Facets of Product Ideas
Hope, Tom, Tamari, Ronen, Kang, Hyeonsu, Hershcovich, Daniel, Chan, Joel, Kittur, Aniket, Shahaf, Dafna
Web-scale repositories of products, patents and scientific papers offer an opportunity for creating automated systems that scour millions of ideas and assist users in discovering inspirations and solutions. Yet the common representation of ideas is in the form of raw textual descriptions, lacking important structure that is required for supporting creative innovation. Prior work has pointed to the importance of functional structure -- capturing the mechanisms and purposes of inventions -- for allowing users to discover structural connections across ideas and creatively adapt existing technologies. However, the use of functional representations was either coarse and limited in expressivity, or dependent on curated knowledge bases with poor coverage and significant manual effort from users. To help bridge this gap and unlock the potential of large-scale idea mining, we propose a novel computational representation that automatically breaks up products into fine-grained functional facets. We train a model to extract these facets from a challenging real-world corpus of invention descriptions, and represent each product as a set of facet embeddings. We design similarity metrics that support granular matching between functional facets across ideas, and use them to build a novel functional search capability that enables expressive queries for mechanisms and purposes. We construct a graph capturing hierarchical relations between purposes and mechanisms across an entire corpus of products, and use the graph to help problem-solvers explore the design space around a focal problem and view related problem perspectives. In empirical user studies, our approach leads to a significant boost in search accuracy and in the quality of creative inspirations, outperforming strong baselines and state-of-art representations of product texts by 50-60%.
Information Ranking Using Optimum-Path Forest
Ascenção, Nathalia Q., Afonso, Luis C. S., Colombo, Danilo, Oliveira, Luciano, Papa, João P.
The task of learning to rank has been widely studied by the machine learning community, mainly due to its use and great importance in information retrieval, data mining, and natural language processing. Therefore, ranking accurately and learning to rank are crucial tasks. Context-Based Information Retrieval systems have been of great importance to reduce the effort of finding relevant data. Such systems have evolved by using machine learning techniques to improve their results, but they are mainly dependent on user feedback. Although information retrieval has been addressed in different works along with classifiers based on Optimum-Path Forest (OPF), these have so far not been applied to the learning to rank task. Therefore, the main contribution of this work is to evaluate classifiers based on Optimum-Path Forest, in such a context. Experiments were performed considering the image retrieval and ranking scenarios, and the performance of OPF-based approaches was compared to the well-known SVM-Rank pairwise technique and a baseline based on distance calculation. The experiments showed competitive results concerning precision and outperformed traditional techniques in terms of computational load.