AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Nested Named Entity Recognition with Partially-Observed TreeCRFs

Fu, Yao, Tan, Chuanqi, Chen, Mosha, Huang, Songfang, Huang, Fei

arXiv.org Artificial IntelligenceDec-15-2020

Named entity recognition (NER) is a well-studied task in natural language processing. However, the widely-used sequence labeling framework is difficult to detect entities with nested structures. In this work, we view nested NER as constituency parsing with partially-observed trees and model it with partially-observed TreeCRFs. Specifically, we view all labeled entity spans as observed nodes in a constituency tree, and other spans as latent nodes. With the TreeCRF we achieve a uniform way to jointly model the observed and the latent nodes. To compute the probability of partial trees with partial marginalization, we propose a variant of the Inside algorithm, the \textsc{Masked Inside} algorithm, that supports different inference operations for different nodes (evaluation for the observed, marginalization for the latent, and rejection for nodes incompatible with the observed) with efficient parallelized implementation, thus significantly speeding up training and inference. Experiments show that our approach achieves the state-of-the-art (SOTA) F1 scores on the ACE2004, ACE2005 dataset, and shows comparable performance to SOTA models on the GENIA dataset. Our approach is implemented at: \url{https://github.com/FranxYao/Partially-Observed-TreeCRFs}.

computational linguistic, proceedings, treecrf, (14 more...)

arXiv.org Artificial Intelligence

2012.08478

Country:

Asia > Indonesia > Java > Jakarta > Jakarta (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)

Add feedback

Distant-Supervised Slot-Filling for E-Commerce Queries

Manchanda, Saurav, Sharma, Mohit, Karypis, George

arXiv.org Artificial IntelligenceDec-15-2020

Slot-filling refers to the task of annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). These characteristics can then be used by a search engine to return results that better match the query's product intent. Traditional methods for slot-filling require the availability of training data with ground truth slot-annotation information. However, generating such labeled data, especially in e-commerce is expensive and time-consuming because the number of slots increases as new products are added. In this paper, we present distant-supervised probabilistic generative models, that require no manual annotation. The proposed approaches leverage the readily available historical query logs and the purchases that these queries led to, and also exploit co-occurrence information among the slots in order to identify intended product characteristics. We evaluate our approaches by considering how they affect retrieval performance, as well as how well they classify the slots. In terms of retrieval, our approaches achieve better ranking performance (up to 156%) over Okapi BM25. Moreover, our approach that leverages co-occurrence information leads to better performance than the one that does not on both the retrieval and slot classification tasks.

candidate slot-set, product category, query, (13 more...)

arXiv.org Artificial Intelligence

2012.08134

Country:

North America > United States > Minnesota (0.04)
North America > United States > Maryland > Montgomery County > Gaithersburg (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.83)

Industry:

Information Technology > Services > e-Commerce Services (0.72)
Retail (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

Discriminative Pre-training for Low Resource Title Compression in Conversational Grocery

Mukherjee, Snehasish, Sayapaneni, Phaniram, Subramanya, Shankar

arXiv.org Artificial IntelligenceDec-12-2020

The ubiquity of smart voice assistants has made conversational shopping commonplace. This is especially true for low consideration segments like grocery. A central problem in conversational grocery is the automatic generation of short product titles that can be read out fast during a conversation. Several supervised models have been proposed in the literature that leverage manually labeled datasets and additional product features to generate short titles automatically. However, obtaining large amounts of labeled data is expensive and most grocery item pages are not as feature-rich as other categories. To address this problem we propose a pre-training based solution that makes use of unlabeled data to learn contextual product representations which can then be fine-tuned to obtain better title compression even in a low resource setting. We use a self-attentive BiLSTM encoder network with a time distributed softmax layer for the title compression task. We overcome the vocabulary mismatch problem by using a hybrid embedding layer that combines pre-trained word embeddings with trainable character level convolutions. We pre-train this network as a discriminator on a replaced-token detection task over a large number of unlabeled grocery product titles. Finally, we fine tune this network, without any modifications, with a small labeled dataset for the title compression task. Experiments on Walmart's online grocery catalog show our model achieves performance comparable to state-of-the-art models like BERT and XLNet. When fine tuned on all of the available training data our model attains an F1 score of 0.8558 which lags the best performing model, BERT-Base, by 2.78% and XLNet by 0.28% only, while using 55 times lesser parameters than both. Further, when allowed to fine tune on 5% of the training data only, our model outperforms BERT-Base by 24.3% in F1 score.

computational linguistic, proceedings, product title, (11 more...)

arXiv.org Artificial Intelligence

2012.06943

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
(18 more...)

Genre:

Overview (0.94)
Research Report (0.70)

Industry: Retail (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

An End-to-End Solution for Named Entity Recognition in eCommerce Search

Cheng, Xiang, Bowden, Mitchell, Bhange, Bhushan Ramesh, Goyal, Priyanka, Packer, Thomas, Javed, Faizan

arXiv.org Artificial IntelligenceDec-10-2020

Named entity recognition (NER) is a critical step in modern search query understanding. In the domain of eCommerce, identifying the key entities, such as brand and product type, can help a search engine retrieve relevant products and therefore offer an engaging shopping experience. Recent research shows promising results on shared benchmark NER tasks using deep learning methods, but there are still unique challenges in the industry regarding domain knowledge, training data, and model production. This paper demonstrates an end-to-end solution to address these challenges. The core of our solution is a novel model training framework "TripleLearn" which iteratively learns from three separate training datasets, instead of one training set as is traditionally done. Using this approach, the best model lifts the F1 score from 69.5 to 93.3 on the holdout test data. In our offline experiments, TripleLearn improved the model performance compared to traditional training approaches which use a single set of training data. Moreover, in the online A/B test, we see significant improvements in user engagement and revenue conversion. The model has been live on homedepot.com for more than 9 months, boosting search conversions and revenue. Beyond our application, this TripleLearn framework, as well as the end-to-end process, is model-independent and problem-independent, so it can be generalized to more industrial applications, especially to the eCommerce industry which has similar data foundations and problems.

product type, query, training data, (14 more...)

arXiv.org Artificial Intelligence

2012.07553

Country:

North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.05)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Services > e-Commerce Services (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Risk-based Adaptive Deep Learning for Entity Resolution

Chen, Qun, Chen, Zhaoqiang, Nafa, Youcef, Duan, Tianyi, Li, Zhanhuai

arXiv.org Artificial IntelligenceDec-10-2020

The state-of-the-art performance on entity resolution (ER) has been achieved by deep learning. However, deep models are usually trained on large quantities of accurately labeled training data, and can not be easily tuned towards a target workload. Unfortunately, in real scenarios, there may not be sufficient labeled training data, and even worse, their distribution is usually more or less different from the target workload even when they come from the same domain. To alleviate the said limitations, this paper proposes a novel risk-based approach to tune a deep model towards a target workload by its particular characteristics. Built on the recent advances on risk analysis for ER, the proposed approach first trains a deep model on labeled training data, and then fine-tunes it by minimizing its estimated misprediction risk on unlabeled target data. Our theoretical analysis shows that risk-based adaptive training can correct the label status of a mispredicted instance with a fairly good chance. We have also empirically validated the efficacy of the proposed approach on real benchmark data by a comparative study. Our extensive experiments show that it can considerably improve the performance of deep models. Furthermore, in the scenario of distribution misalignment, it can similarly outperform the state-of-the-art alternative of transfer learning by considerable margins. Using ER as a test case, we demonstrate that risk-based adaptive training is a promising approach potentially applicable to various challenging classification tasks.

denote, probability, training data, (16 more...)

arXiv.org Artificial Intelligence

2012.03513

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.71)

Add feedback

information retrieval document search using vector space model in R

#artificialintelligenceDec-9-2020, 11:30:38 GMT

Now calculate cosine similarity between each document and each query. For each query sort the cosine similarity scores for all the documents and take top-3 documents having high scores.

query, term document matrix, vector, (14 more...)

#artificialintelligence

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > Hawaii > Honolulu County > Honolulu (0.05)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Law (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.45)

Add feedback

Airbus AI Introduces Natural Language QA System for Flight Crews

#artificialintelligenceDec-6-2020, 01:10:50 GMT

Airbus AI researchers have developed a system that uses natural language understanding to improve question answering (QA) performance when flight crews search for aircraft operating information. The aerospace industry relies on technical documents such as Aircraft Operating Manuals (AOM), Aircraft Operating Instructions and particularly Flight Crew Operating Manuals (FCOM) to guide flight crews on aircraft operations under normal, abnormal, and emergency conditions. FCOMs are issued by aircraft manufacturers and cover system descriptions, procedures, techniques, and performance data. They are the references used to develop standard operating procedures to improve safety and efficiency. Most government aviation administrations have authorized the use of tablet computers by commercial carrier pilots and flight crews to access FCOM information. The Airbus AI researchers note however that existing electronic flight bag (EFB) systems used for this purpose are in practice little more than pdf viewers with keyword search functionality.

dialogue engine, flight crew, introduce natural language qa system, (8 more...)

#artificialintelligence

Country: Asia > China (0.08)

Industry:

Transportation > Air (1.00)
Government > Military > Air Force (1.00)
Consumer Products & Services > Travel (1.00)
Aerospace & Defense (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.39)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.39)

Add feedback

End-to-End QA on COVID-19: Domain Adaptation with Synthetic Training

Reddy, Revanth Gangi, Iyer, Bhavani, Sultan, Md Arafat, Zhang, Rong, Sil, Avi, Castelli, Vittorio, Florian, Radu, Roukos, Salim

arXiv.org Artificial IntelligenceDec-2-2020

End-to-end question answering (QA) requires both information retrieval (IR) over a large document collection and machine reading comprehension (MRC) on the retrieved passages. Recent work has successfully trained neural IR systems using only supervised question answering (QA) examples from open-domain datasets. However, despite impressive performance on Wikipedia, neural IR lags behind traditional term matching approaches such as BM25 in more specific and specialized target domains such as COVID-19. Furthermore, given little or no labeled data, effective adaptation of QA systems can also be challenging in such target domains. In this work, we explore the application of synthetically generated QA examples to improve performance on closed-domain retrieval and MRC. We combine our neural IR and MRC systems and show significant improvements in end-to-end QA on the CORD-19 collection over a state-of-the-art open-domain QA baseline.

covid-19, dataset, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2012.01414

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

ClimaText: A Dataset for Climate Change Topic Detection

Varini, Francesco S., Boyd-Graber, Jordan, Ciaramita, Massimiliano, Leippold, Markus

arXiv.org Artificial IntelligenceDec-1-2020

Climate change communication in the mass media and other textual sources may affect and shape public perception. Extracting climate change information from these sources is an important task, e.g., for filtering content and e-discovery, sentiment analysis, automatic summarization, question-answering, and fact-checking. However, automating this process is a challenge, as climate change is a complex, fast-moving, and often ambiguous topic with scarce resources for popular text-based AI tasks. In this paper, we introduce \textsc{ClimaText}, a dataset for sentence-based climate change topic detection, which we make publicly available. We explore different approaches to identify the climate change topic in various text sources. We find that popular keyword-based models are not adequate for such a complex and evolving task. Context-based algorithms like BERT \cite{devlin2018bert} can detect, in addition to many trivial cases, a variety of complex and implicit topic patterns. Nevertheless, our analysis reveals a great potential for improvement in several directions, such as, e.g., capturing the discussion on indirect effects of climate change. Hence, we hope this work can serve as a good starting point for further research on this topic.

climate change, climatext, dataset, (16 more...)

arXiv.org Artificial Intelligence

2012.00483

Country:

Europe > Switzerland > Zürich > Zürich (0.15)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(3 more...)

Genre: Overview (0.94)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.87)

Add feedback

Search Engine Optimization Complete Specialization Course

#artificialintelligenceNov-27-2020, 19:55:11 GMT

Welcome to the World's best specialized SEO course ever. This is the only course in the world where you woll also learn about the technicalities of SEO and how to handle them. The content of this course is based on real world practices and checklists used by professionals in the SEO world. How to get a job in SEO? How to start your own digital marketing company?

engine optimization complete specialization course, search engine optimization

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry: Education (0.74)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.50)

Add feedback