AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

Formal, Thibault, Lassance, Carlos, Piwowarski, Benjamin, Clinchant, Stéphane

arXiv.org Artificial IntelligenceSep-21-2021

In neural Information Retrieval (IR), ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. Meanwhile, there has been a growing interest in learning \emph{sparse} representations for documents and queries, that could inherit from the desirable properties of bag-of-words models such as the exact matching of terms and the efficiency of inverted indexes. Introduced recently, the SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches. In this paper, we build on SPLADE and propose several significant improvements in terms of effectiveness and/or efficiency. More specifically, we modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation. We also report results on the BEIR benchmark. Overall, SPLADE is considerably improved with more than $9$\% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.

information retrieval, representation, retrieval, (11 more...)

arXiv.org Artificial Intelligence

2109.10086

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > Canada (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Hou, Zhijian, Ngo, Chong-Wah, Chan, Wing Kwong

arXiv.org Artificial IntelligenceSep-21-2021

This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel CONtextual QUery-awarE Ranking~(CONQUER) model for effective moment localization and ranking. CONQUER explores query context for multi-modal fusion and representation learning in two different steps. The first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to query. We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos, to investigate the potential advantages of fusing video and query online as a joint representation for moment retrieval.

conquer, query, video, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3474085.3475281

2109.10016

Country:

Asia > China > Hong Kong (0.04)
Asia > Singapore (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Poland (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.70)

Add feedback

Generating Local Maps of Science using Deep Bibliographic Coupling

Candel, Gaëlle, Naccache, David

arXiv.org Artificial IntelligenceSep-21-2021

Bibliographic and co-citation coupling are two analytical methods widely used to measure the degree of similarity between scientific papers. These approaches are intuitive, easy to put into practice, and computationally cheap. Moreover, they have been used to generate a map of science, allowing visualizing research field interactions. Nonetheless, these methods do not work unless two papers share a standard reference, limiting the two papers usability with no direct connection. In this work, we propose to extend bibliographic coupling to the deep neighborhood, by using graph diffusion methods. This method allows defining similarity between any two papers, making it possible to generate a local map of science, highlighting field organization.

generating local map, keyword, similarity, (13 more...)

arXiv.org Artificial Intelligence

2109.10007

Country: Asia > Taiwan (0.04)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
(5 more...)

Add feedback

Data Insights for Everyone -- The Semantic Layer to the Rescue

#artificialintelligenceSep-20-2021, 21:11:35 GMT

What is a semantic layer? That's a good question, but let's first explain semantics. The way that I explained it to my data science students years ago was like this. In the early days of web search engines, those engines were primarily keyword search engines. If you knew the right keywords to search and if the content providers also used the same keywords on their website, then you could type the words into your favorite search engine and find the content you needed.

data science modeler, search engine, semantic layer, (10 more...)

#artificialintelligence

Country: North America > United States > Texas (0.06)

Industry: Education (0.35)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.99)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.71)

Add feedback

Language Identification with a Reciprocal Rank Classifier

Widdows, Dominic, Brew, Chris

arXiv.org Artificial IntelligenceSep-20-2021

Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the macro-averaged F1-score of a conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast, the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are compared with those from fastText and langid. The RRC performs better than these established systems in most experiments, especially on short Wikipedia texts and Twitter. The RRC classifier can be improved for particular domains and conversational situations by adding words to the ranked lists. Using new terms learned from such conversations, we demonstrate a further 7.9% increase in accuracy of sample message classification, and 1.7% increase for conversation classification. Surprisingly, this made results on Twitter data slightly worse. The RRC classifier is available as an open source Python package (https://github.com/LivePersonInc/lplangid).

classifier, information retrieval, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2109.09862

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
Asia > Thailand > Chiang Mai > Chiang Mai (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.34)

Add feedback

The Case for Claim Difficulty Assessment in Automatic Fact Checking

Singh, Prakhar, Das, Anubrata, Li, Junyi Jessy, Lease, Matthew

arXiv.org Artificial IntelligenceSep-20-2021

Fact-checking is the process (human, automated, or hybrid) by which claims (i.e., purported facts) are evaluated for veracity. In this article, we raise an issue that has received little attention in prior work - that some claims are far more difficult to fact-check than others. We discuss the implications this has for both practical fact-checking and research on automated fact-checking, including task formulation and dataset design. We report a manual analysis undertaken to explore factors underlying varying claim difficulty and categorize several distinct types of difficulty. We argue that prediction of claim difficulty is a missing component of today's automated fact-checking architectures, and we describe how this difficulty prediction task might be split into a set of distinct subtasks.

claim difficulty, natural language processing, proceedings, (11 more...)

arXiv.org Artificial Intelligence

2109.09689

Country:

North America > United States > Wisconsin (0.05)
North America > United States > Texas > Travis County > Austin (0.04)
Africa > South Africa (0.04)

Genre: Research Report (0.40)

Industry:

Media > News (1.00)
Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(2 more...)

Technology:

Information Technology > Communications > Social Media (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

Design meets artificial intelligence to create new visual search engine

#artificialintelligenceSep-17-2021, 01:30:10 GMT

Novel methods of searching the nation's gallery, library and museum collections could soon be revolutionized by a visual search platform designed in collaboration with Northumbria University. As the sector worldwide moves towards presenting collections online, the Deep Discoveries project was launched to explore ways of creating a computer vision search platform that can identify and match images across digitized collections on a national scale. The expertise of Dr. Jo Briggs and Associate Professor Jamie Steane, from Northumbria School of Design, were enlisted to help deliver the collaboration between The National Archives, the University of Surrey and the V&A Museum. Rather than typing a keyword into an empty search box, visual search uses a query image and computer vision artificial intelligence (AI), to match similar images from across digitized collections based on properties such as color, pattern and shape. The Northumbria design team--made up of Jo, Jamie and talented graduate Andy Cain--joined the project at a later stage to help with information sharing and developing the user experience.

artificial intelligence, create new visual search engine, design meet artificial intelligence, (8 more...)

#artificialintelligence

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation

Han, Yuxing, Wu, Ziniu, Wu, Peizhi, Zhu, Rong, Yang, Jingyi, Tan, Liang Wei, Zeng, Kai, Cong, Gao, Qin, Yanzhao, Pfadler, Andreas, Qian, Zhengping, Zhou, Jingren, Li, Jiangneng, Cui, Bin

arXiv.org Artificial IntelligenceSep-15-2021

Cardinality estimation (CardEst) plays a significant role in generating high-quality query plans for a query optimizer in DBMS. In the last decade, an increasing number of advanced CardEst methods (especially ML-based) have been proposed with outstanding estimation accuracy and inference latency. However, there exists no study that systematically evaluates the quality of these methods and answer the fundamental problem: to what extent can these methods improve the performance of query optimizer in real-world settings, which is the ultimate goal of a CardEst method. In this paper, we comprehensively and systematically compare the effectiveness of CardEst methods in a real DBMS. We establish a new benchmark for CardEst, which contains a new complex real-world dataset STATS and a diverse query workload STATS-CEB. We integrate multiple most representative CardEst methods into an open-source database system PostgreSQL, and comprehensively evaluate their true effectiveness in improving query plan quality, and other important aspects affecting their applicability, ranging from inference latency, model size, and training time, to update efficiency and accuracy. We obtain a number of key findings for the CardEst methods, under different data and query settings. Furthermore, we find that the widely used estimation accuracy metric(Q-Error) cannot distinguish the importance of different sub-plan queries during query optimization and thus cannot truly reflect the query plan quality generated by CardEst methods. Therefore, we propose a new metric P-Error to evaluate the performance of CardEst methods, which overcomes the limitation of Q-Error and is able to reflect the overall end-to-end performance of CardEst methods. We have made all of the benchmark data and evaluation code publicly available at https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark.

cardest method, estimation, query, (15 more...)

arXiv.org Artificial Intelligence

2109.05877

Country:

Asia > Middle East > UAE (0.05)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

What are the attackers doing now? Automating cyber threat intelligence extraction from text on pace with the changing threat landscape: A survey

Rahman, Md Rayhanur, Mahdavi-Hezaveh, Rezvan, Williams, Laurie

arXiv.org Artificial IntelligenceSep-14-2021

Cybersecurity researchers have contributed to the automated extraction of CTI from textual sources, such as threat reports and online articles, where cyberattack strategies, procedures, and tools are described. The goal of this article is to aid cybersecurity researchers understand the current techniques used for cyberthreat intelligence extraction from text through a survey of relevant studies in the literature. We systematically collect "CTI extraction from text"-related studies from the literature and categorize the CTI extraction purposes. We propose a CTI extraction pipeline abstracted from these studies. We identify the data sources, techniques, and CTI sharing formats utilized in the context of the proposed pipeline. Our work finds ten types of extraction purposes, such as extraction indicators of compromise extraction, TTPs (tactics, techniques, procedures of attack), and cybersecurity keywords. We also identify seven types of textual sources for CTI extraction, and textual data obtained from hacker forums, threat reports, social media posts, and online news articles have been used by almost 90% of the studies. Natural language processing along with both supervised and unsupervised machine learning techniques such as named entity recognition, topic modelling, dependency parsing, supervised classification, and clustering are used for CTI extraction. We observe the technical challenges associated with these studies related to obtaining available clean, labelled data which could assure replication, validation, and further extension of the studies. As we find the studies focusing on CTI information extraction from text, we advocate for building upon the current CTI extraction work to help cybersecurity practitioners with proactive decision making such as threat prioritization, automated threat modelling to utilize knowledge from past cybersecurity incidents.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3571726

2109.06808

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.14)
North America > United States > Utah (0.04)
North America > United States > Virginia (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (1.00)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(3 more...)

Add feedback

Online Learning of Optimally Diverse Rankings

Magureanu, Stefan, Proutiere, Alexandre, Isaksson, Marcus, Zhang, Boxun

arXiv.org Machine LearningSep-13-2021

Search engines answer users' queries by listing relevant items (e.g. documents, songs, products, web pages, ...). These engines rely on algorithms that learn to rank items so as to present an ordered list maximizing the probability that it contains relevant item. The main challenge in the design of learning-to-rank algorithms stems from the fact that queries often have different meanings for different users. In absence of any contextual information about the query, one often has to adhere to the {\it diversity} principle, i.e., to return a list covering the various possible topics or meanings of the query. To formalize this learning-to-rank problem, we propose a natural model where (i) items are categorized into topics, (ii) users find items relevant only if they match the topic of their query, and (iii) the engine is not aware of the topic of an arriving query, nor of the frequency at which queries related to various topics arrive, nor of the topic-dependent click-through-rates of the items. For this problem, we devise LDR (Learning Diverse Rankings), an algorithm that efficiently learns the optimal list based on users' feedback only. We show that after $T$ queries, the regret of LDR scales as $O((N-L)\log(T))$ where $N$ is the number of all items. We further establish that this scaling cannot be improved, i.e., LDR is order optimal. Finally, using numerical experiments on both artificial and real-world data, we illustrate the superiority of LDR compared to existing learning-to-rank algorithms.

algorithm, publication date, query, (14 more...)

arXiv.org Machine Learning

doi: 10.1145/3154490 10.1145/3219617.3219637

2109.05899

Country: Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.64)

Industry:

Media > Music (0.46)
Leisure & Entertainment (0.46)
Education > Educational Setting > Online (0.42)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.46)
(2 more...)

Add feedback