AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Grounding Language Models for Visual Entity Recognition

Xiao, Zilin, Gong, Ming, Cascante-Bonilla, Paola, Zhang, Xingyao, Wu, Jie, Ordonez, Vicente

arXiv.org Artificial IntelligenceFeb-28-2024

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.

grounding language model, language model, recognition, (10 more...)

arXiv.org Artificial Intelligence

2402.18695

Country:

Europe > Austria (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > New York > New York County > New York City (0.04)
(7 more...)

Genre: Research Report (0.82)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Transportation > Air (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

UniRetriever: Multi-task Candidates Selection for Various Context-Adaptive Conversational Retrieval

Wang, Hongru, Xue, Boyang, Zhou, Baohang, Wang, Rui, Mi, Fei, Wang, Weichao, Wang, Yasheng, Wong, Kam-Fai

arXiv.org Artificial IntelligenceFeb-28-2024

Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.

computational linguistic, selection, selection task, (14 more...)

arXiv.org Artificial Intelligence

2402.16261

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > Dominican Republic (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

Web crawler strategies for web pages under robot.txt restriction

Vyas, Piyush, Chauhan, Akhilesh, Mandge, Tushar, Hardikar, Surbhi

arXiv.org Artificial IntelligenceFeb-28-2024

In the present time, all know about World Wide Web and work over the Internet daily. In this paper, we introduce the search engines working for keywords that are entered by users to find something. The search engine uses different search algorithms for convenient results for providing to the net surfer. Net surfers go with the top search results but how did the results of web pages get higher ranks over search engines? how the search engine got that all the web pages in the database? This paper gives the answers to all these kinds of basic questions. Web crawlers working for search engines and robot exclusion protocol rules for web crawlers are also addressed in this research paper. Webmaster uses different restriction facts in robot.txt file to instruct web crawler, some basic formats of robot.txt are also mentioned in this paper.

crawler, engine, search engine, (11 more...)

arXiv.org Artificial Intelligence

2308.04689

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.15)
Asia > India (0.06)
Africa > Mali (0.05)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining > Web Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Unlocking the potential of entity-centric knowledge graphs: transforming healthcare and beyond

AIHubFeb-27-2024, 16:30:38 GMT

Knowledge graphs (KGs) have become a cornerstone in organizing and utilizing information across various domains, from enhancing search engines to improving recommendation systems. KGs comprise nodes (entities) and edges (relations) that depict the knowledge within a specific field or a collection of domains. The potential of KGs to enable intricate reasoning and inference has been investigated across various endeavors, encompassing tasks such as information retrieval, and knowledge discovery. While KGs have come a long way, representing knowledge effectively remains a formidable challenge, especially in complex fields like healthcare and biomedicine. This article highlights our recent publication Representation Learning for Person or Entity-centric Knowledge Graphs: An Application in Healthcare (presented at K-CAP 2023) and explores the concept of entity-centric knowledge graphs, a relatively uncharted territory in the KG landscape, but one that holds immense promise in reshaping how we organize, access, and leverage data.

entity-centric knowledge graph, knowledge graph, representation, (12 more...)

AIHub

Country: Europe (0.05)

Genre: Research Report > New Finding (0.47)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.31)
Health & Medicine > Health Care Providers & Services (0.30)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

LinkNER: Linking Local Named Entity Recognition Models to Large Language Models using Uncertainty

Zhang, Zhen, Zhao, Yuhua, Gao, Hang, Hu, Mengting

arXiv.org Artificial IntelligenceFeb-27-2024

Named Entity Recognition (NER) serves as a fundamental task in natural language understanding, bearing direct implications for web content analysis, search engines, and information retrieval systems. Fine-tuned NER models exhibit satisfactory performance on standard NER benchmarks. However, due to limited fine-tuning data and lack of knowledge, it performs poorly on unseen entity recognition. As a result, the usability and reliability of NER models in web-related applications are compromised. Instead, Large Language Models (LLMs) like GPT-4 possess extensive external knowledge, but research indicates that they lack specialty for NER tasks. Furthermore, non-public and large-scale weights make tuning LLMs difficult. To address these challenges, we propose a framework that combines small fine-tuned models with LLMs (LinkNER) and an uncertainty-based linking strategy called RDC that enables fine-tuned models to complement black-box LLMs, achieving better performance. We experiment with both standard NER test sets and noisy social media datasets. LinkNER enhances NER task performance, notably surpassing SOTA models in robustness tests. We also quantitatively analyze the influence of key components like uncertainty estimation methods, LLMs, and in-context learning on diverse NER tasks, offering specific web-related recommendations.

dataset, linkner, llm, (15 more...)

arXiv.org Artificial Intelligence

2402.10573

Country:

Asia > Singapore > Central Region > Singapore (0.05)
Asia > China > Tianjin Province > Tianjin (0.05)
North America > Canada > Manitoba (0.05)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Corpus-Steered Query Expansion with Large Language Models

Lei, Yibin, Cao, Yu, Zhou, Tianyi, Shen, Tao, Yates, Andrew

arXiv.org Artificial IntelligenceFeb-27-2024

Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

csqe, expansion, llm, (14 more...)

arXiv.org Artificial Intelligence

2402.18031

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.04)
Asia > Singapore (0.04)
South America (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.84)

Add feedback

Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey

Le, Dinh-Viet-Toan, Bigo, Louis, Keller, Mikaela, Herremans, Dorien

arXiv.org Artificial IntelligenceFeb-27-2024

Several adaptations of Transformers models have been developed in various domains since its breakthrough in Natural Language Processing (NLP). This trend has spread into the field of Music Information Retrieval (MIR), including studies processing music data. However, the practice of leveraging NLP tools for symbolic music data is not novel in MIR. Music has been frequently compared to language, as they share several similarities, including sequential representations of text and music. These analogies are also reflected through similar tasks in MIR and NLP. This survey reviews NLP methods applied to symbolic music generation and information retrieval studies following two axes. We first propose an overview of representations of symbolic music adapted from natural language sequential representations. Such representations are designed by considering the specificities of symbolic music. These representations are then processed by models. Such models, possibly originally developed for text and adapted for symbolic music, are trained on various tasks. We describe these models, in particular deep learning models, through different prisms, highlighting music-specialized mechanisms. We finally present a discussion surrounding the effective use of NLP tools for symbolic music data. This includes technical issues regarding NLP methods and fundamental differences between text and music, which may open several doors for further research into more effectively adapting NLP tools to symbolic MIR.

music, representation, symbolic music, (11 more...)

arXiv.org Artificial Intelligence

2402.17467

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(23 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.45)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks

Guo, Jiafeng, Zhou, Changjiang, Zhang, Ruqing, Chen, Jiangui, de Rijke, Maarten, Fan, Yixing, Cheng, Xueqi

arXiv.org Artificial IntelligenceFeb-26-2024

Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers. Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance. However, most existing research on KILTs, including CorpusBrain, has predominantly focused on a static document collection, overlooking the dynamic nature of real-world scenarios, where new documents are continuously being incorporated into the source corpus. To address this gap, it is crucial to explore the capability of retrieval models to effectively handle the dynamic retrieval scenario inherent in KILTs. In this work, we first introduce the continual document learning (CDL) task for KILTs and build a novel benchmark dataset named KILT++ based on the original KILT dataset for evaluation. Then, we conduct a comprehensive study over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in the dynamic scenario, hence hampering the retrieval performance. To alleviate this issue, we propose CorpusBrain++, a continual generative pre-training framework. Empirical results demonstrate the significant effectiveness and remarkable efficiency of CorpusBrain++ in comparison to both traditional and generative IR methods.

dataset, docid, retrieval performance, (15 more...)

arXiv.org Artificial Intelligence

2402.16767

Country:

Africa > South Africa (0.69)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Beijing > Beijing (0.04)
(4 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology (0.93)
Government > Regional Government > Africa Government > South Africa Government (0.47)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

IR2: Information Regularization for Information Retrieval

Wang, Jianyou, Wang, Kaicheng, Wang, Xiaoyue, Cao, Weili, Paturi, Ramamohan, Bergen, Leon

arXiv.org Artificial IntelligenceFeb-25-2024

Effective information retrieval (IR) in settings with limited training data, particularly for complex queries, remains a challenging task. This paper introduces IR2, Information Regularization for Information Retrieval, a technique for reducing overfitting during synthetic data generation. This approach, representing a novel application of regularization techniques in synthetic data creation for IR, is tested on three recent IR tasks characterized by complex queries: DORIS-MAE, ArguAna, and WhatsThatBook. Experimental results indicate that our regularization techniques not only outperform previous synthetic query generation methods on the tasks considered but also reduce cost by up to 50%. Furthermore, this paper categorizes and explores three regularization methods at different stages of the query synthesis pipeline-input, prompt, and output-each offering varying degrees of performance improvement compared to models where no regularization is applied. This provides a systematic approach for optimizing synthetic data generation in data-limited, complex-query IR scenarios. All code, prompts and synthetic data are available at https://github.com/Info-Regularization/Information-Regularization.

information retrieval, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.162

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Evaluating Robustness of Generative Search Engine on Adversarial Factual Questions

Hu, Xuming, Li, Xiaochuan, Chen, Junzhe, Li, Yinghui, Li, Yangning, Li, Xiaoguang, Wang, Yasheng, Liu, Qun, Wen, Lijie, Yu, Philip S., Guo, Zhijiang

arXiv.org Artificial IntelligenceFeb-25-2024

Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment.

generative search engine, language model, search engine, (13 more...)

arXiv.org Artificial Intelligence

2403.12077

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Los Angeles County > Burbank (0.04)
Asia > Singapore (0.04)
(23 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Media > Music (1.00)
Media > Film (1.00)
Leisure & Entertainment (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback