AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

A Computational Inflection for Scientific Discovery

Communications of the ACMJul-31-2023, 22:21:30 GMT

We leverage research in natural language processing (NLP), information retrieval, data mining, and human-computer interaction (HCI) and draw concepts from multiple disciplines. For example, efforts in metascience focus on sociological factors that influence the evolution of science,17 such as analyses of information silos that impede mutual understanding and interaction,38 of macro-scale ramifications of the rapid growth in scholarly publications,4 and of current metrics for measuring impact5--work enabled by digitization of scholarly corpora. Metascience research makes important observations about human biases (desideratum 2) but generally does not engage in building computational interventions to augment researchers (desideratum 1). Conversely, work in literature-based discovery33 mines information from literature to generate new predictions (for example, functions of materials or drug targets) but is typically done in isolation from cognitive considerations; however, these techniques have great promise in being used as part of human-augmentation systems. Other work uses machines to automate aspects of science.

information, knowledge, representation, (17 more...)

Communications of the ACM

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Washington > King County > Redmond (0.04)
North America > United States > Illinois > Cook County > Evanston (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scientific Discovery (0.83)

Add feedback

Defense of Adversarial Ranking Attack in Text Retrieval: Benchmark and Baseline via Detection

Chen, Xuanang, He, Ben, Sun, Le, Sun, Yingfei

arXiv.org Artificial IntelligenceJul-31-2023

Neural ranking models (NRMs) have undergone significant development and have become integral components of information retrieval (IR) systems. Unfortunately, recent research has unveiled the vulnerability of NRMs to adversarial document manipulations, potentially exploited by malicious search engine optimization practitioners. While progress in adversarial attack strategies aids in identifying the potential weaknesses of NRMs before their deployment, the defensive measures against such attacks, like the detection of adversarial documents, remain inadequately explored. To mitigate this gap, this paper establishes a benchmark dataset to facilitate the investigation of adversarial ranking defense and introduces two types of detection tasks for adversarial documents. A comprehensive investigation of the performance of several detection baselines is conducted, which involve examining the spamicity, perplexity, and linguistic acceptability, and utilizing supervised classifiers. Experimental results demonstrate that a supervised classifier can effectively mitigate known attacks, but it performs poorly against unseen attacks. Furthermore, such classifier should avoid using query text to prevent learning the classification on relevance, as it might lead to the inadvertent discarding of relevant documents.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2307.16816

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Austria > Vienna (0.14)
(12 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Lexically-Accelerated Dense Retrieval

Kulkarni, Hrishikesh, MacAvaney, Sean, Goharian, Nazli, Frieder, Ophir

arXiv.org Artificial IntelligenceJul-31-2023

Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve these gains, dense retrieval approaches typically require an exhaustive search over the document collection, making them considerably more expensive at query-time than conventional lexical approaches. Several techniques aim to reduce this computational overhead by approximating the results of a full dense retriever. Although these approaches reasonably approximate the top results, they suffer in terms of recall -- one of the key advantages of dense retrieval. We introduce 'LADR' (Lexically-Accelerated Dense Retrieval), a simple-yet-effective approach that improves the efficiency of existing dense retrieval models without compromising on retrieval effectiveness. LADR uses lexical retrieval techniques to seed a dense retrieval exploration that uses a document proximity graph. We explore two variants of LADR: a proactive approach that expands the search space to the neighbors of all seed documents, and an adaptive approach that selectively searches the documents with the highest estimated relevance in an iterative fashion. Through extensive experiments across a variety of dense retrieval models, we find that LADR establishes a new dense retrieval effectiveness-efficiency Pareto frontier among approximate k nearest neighbor techniques. Further, we find that when tuned to take around 8ms per query in retrieval latency on our hardware, LADR consistently achieves both precision and recall that are on par with an exhaustive search on standard benchmarks.

information retrieval, ladr, machine learning, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3539618.3591715

2307.16779

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > District of Columbia > Washington (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)

Add feedback

Tackling Query-Focused Summarization as A Knowledge-Intensive Task: A Pilot Study

Zhang, Weijia, Vakulenko, Svitlana, Rajapakse, Thilina, Xu, Yumo, Kanoulas, Evangelos

arXiv.org Artificial IntelligenceJul-31-2023

Query-focused summarization (QFS) requires generating a summary given a query using a set of relevant documents. However, such relevant documents should be annotated manually and thus are not readily available in realistic scenarios. To address this limitation, we tackle the QFS task as a knowledge-intensive (KI) task without access to any relevant documents. Instead, we assume that these documents are present in a large-scale knowledge corpus and should be retrieved first. To explore this new setting, we build a new dataset (KI-QFS) by adapting existing QFS datasets. In this dataset, answering the query requires document retrieval from a knowledge corpus. We construct three different knowledge corpora, and we further provide relevance annotations to enable retrieval evaluation. Finally, we benchmark the dataset with state-of-the-art QFS models and retrieval-enhanced models. The experimental results demonstrate that QFS models perform significantly worse on KI-QFS compared to the original QFS task, indicating that the knowledge-intensive setting is much more challenging and offers substantial room for improvement. We believe that our investigation will inspire further research into addressing QFS in more realistic scenarios.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2112.07536

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Taiwan > Taiwan Province > Taipei (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(11 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Pre-trained Data Deduplication Model based on Active Learning

Liu, Xinyao, Du, Shengdong, Lv, Fengmao, Xue, Hongtao, Hu, Jie, Li, Tianrui

arXiv.org Artificial IntelligenceJul-30-2023

In the era of big data, the issue of data quality has become increasingly prominent. One of the main challenges is the problem of duplicate data, which can arise from repeated entry or the merging of multiple data sources. These "dirty data" problems can significantly limit the effective application of big data. To address the issue of data deduplication, we propose a pre-trained deduplication model based on active learning, which is the first work that utilizes active learning to address the problem of deduplication at the semantic level. The model is built on a pre-trained Transformer and fine-tuned to solve the deduplication problem as a sequence to classification task, which firstly integrate the transformer with active learning into an end-to-end architecture to select the most valuable data for deduplication model training, and also firstly employ the R-Drop method to perform data augmentation on each round of labeled data, which can reduce the cost of manual labeling and improve the model's performance. Experimental results demonstrate that our proposed model outperforms previous state-of-the-art (SOTA) for deduplicated data identification, achieving up to a 28% improvement in Recall score on benchmark datasets.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.00721

Country:

Oceania > Australia > Western Australia > Perth (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(2 more...)

Add feedback

Integrated Private Data Trading Systems for Data Marketplaces

Li, Weidong, Zhang, Mengxiao, Zhang, Libo, Liu, Jiamou

arXiv.org Artificial IntelligenceJul-30-2023

In the digital age, data is a valuable commodity, and data marketplaces offer lucrative opportunities for data owners to monetize their private data. However, data privacy is a significant concern, and differential privacy has become a popular solution to address this issue. Private data trading systems (PDQS) facilitate the trade of private data by determining which data owners to purchase data from, the amount of privacy purchased, and providing specific aggregation statistics while protecting the privacy of data owners. However, existing PDQS with separated procurement and query processes are prone to over-perturbation of private data and lack trustworthiness. To address this issue, this paper proposes a framework for PDQS with an integrated procurement and query process to avoid excessive perturbation of private data. We also present two instances of this framework, one based on a greedy approach and another based on a neural network. Our experimental results show that both of our mechanisms outperformed the separately conducted procurement and query mechanism under the same budget regarding accuracy.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.16317

Country:

Asia > China (0.04)
Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
South America > Peru (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.49)
(2 more...)

Add feedback

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Park, Jongwoo, Singh, Apoorv, Bankiti, Varun

arXiv.org Artificial IntelligenceJul-28-2023

3D visual perception tasks based on multi-camera images are essential for autonomous driving systems. Latest work in this field performs 3D object detection by leveraging multi-view images as an input and iteratively enhancing object queries (object proposals) by cross-attending multi-view features. However, individual backbone features are not updated with multi-view features and it stays as a mere collection of the output of the single-image backbone network. Therefore we propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection where we update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view. Firstly, we update multi-view features by multi-view axis self-attention. It will incorporate panoramic information in the multi-view features and enhance understanding of the global scene. Secondly, we update multi-view features by self-attention of the ROI (Region of Interest) windows which encodes local finer details in the features. It will help exchange the information not only along the multi-view axis but also along the other spatial dimension. Lastly, we leverage the fact of multi-representation of queries in different domains to further boost the performance. Here we use sparse floating queries along with dense BEV (Bird's Eye View) queries, which are later post-processed to filter duplicate detections. Moreover, we show performance improvements on nuScenes benchmark dataset on top of our baselines.

artificial intelligence, information retrieval query processing, natural language, (15 more...)

arXiv.org Artificial Intelligence

2302.08231

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (0.51)

Industry:

Information Technology (0.49)
Transportation > Ground > Road (0.35)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.35)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

Fraunhofer SIT at CheckThat! 2023: Tackling Classification Uncertainty Using Model Souping on the Example of Check-Worthiness Classification

Frick, Raphael, Vogel, Inna, Choi, Jeong-Eun

arXiv.org Artificial IntelligenceJul-27-2023

This paper describes the second-placed approach developed by the Fraunhofer SIT team in the CLEF-2023 CheckThat! lab Task 1B for English. Given a text snippet from a political debate, the aim of this task is to determine whether it should be assessed for check-worthiness. Detecting check-worthy statements aims to facilitate manual fact-checking efforts by prioritizing the claims that fact-checkers should consider first. It can also be considered as primary step of a fact-checking system. Our best-performing method took advantage of an ensemble classification scheme centered on Model Souping. When applied to the English data set, our submitted model achieved an overall F1 score of 0.878 and was ranked as the second-best model in the competition.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2307.02377

Country:

Europe > Switzerland (0.05)
Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.04)
Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
(4 more...)

Genre: Research Report (0.65)

Industry:

Government (0.94)
Media > News (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.74)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Towards Answering Climate Questionnaires from Unstructured Climate Reports

Spokoyny, Daniel, Laud, Tanmay, Corringham, Tom, Berg-Kirkpatrick, Taylor

arXiv.org Artificial IntelligenceJul-27-2023

The topic of Climate Change (CC) has received limited attention in NLP despite its urgency. Activists and policymakers need NLP tools to effectively process the vast and rapidly growing unstructured textual climate reports into structured form. To tackle this challenge we introduce two new large-scale climate questionnaire datasets and use their existing structure to train self-supervised models. We conduct experiments to show that these models can learn to generalize to climate disclosures of different organizations types than seen during training. We then use these models to help align texts from unstructured climate documents to the semi-structured questionnaires in a human pilot study. Finally, to support further NLP research in the climate domain we introduce a benchmark of existing climate text classification datasets to better evaluate and compare existing models.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.04253

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Iowa (0.04)
(5 more...)

Genre:

Questionnaire & Opinion Survey (0.94)
Research Report (0.64)

Industry:

Law > Environmental Law (1.00)
Government (1.00)
Energy (1.00)
Law > Statutes (0.67)

Technology:

Information Technology > Communications > Social Media (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)
(2 more...)

Add feedback

Unsupervised extraction of local and global keywords from a single text

Aleksanyan, Lida, Allahverdyan, Armen E.

arXiv.org Artificial IntelligenceJul-26-2023

We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2307.14005

Country:

Asia > Russia (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback