AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Taxonomy-guided Semantic Indexing for Academic Paper Search

Kang, SeongKu, Zhang, Yunyi, Jiang, Pengcheng, Lee, Dongha, Han, Jiawei, Yu, Hwanjo

arXiv.org Artificial IntelligenceOct-24-2024

Academic paper search is an essential task for efficient literature discovery and scientific advancement. While dense retrieval has advanced various ad-hoc searches, it often struggles to match the underlying academic concepts between queries and documents, which is critical for paper search. To enable effective academic concept matching for paper search, we propose Taxonomy-guided Semantic Indexing (TaxoIndex) framework. TaxoIndex extracts key concepts from papers and organizes them as a semantic index guided by an academic taxonomy, and then leverages this index as foundational knowledge to identify academic concepts and link queries and documents. As a plug-and-play framework, TaxoIndex can be flexibly employed to enhance existing dense retrievers. Extensive experiments show that TaxoIndex brings significant improvements, even with highly limited training data, and greatly enhances interpretability.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2410.19218

Country:

North America > United States > Illinois (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Add feedback

Health Misinformation in Social Networks: A Survey of IT Approaches

Papanikou, Vasiliki, Papadakos, Panagiotis, Karamanidou, Theodora, Stavropoulos, Thanos G., Pitoura, Evaggelia, Tsaparas, Panayiotis

arXiv.org Artificial IntelligenceOct-24-2024

The spread of misinformation online, most commonly known as fake news, is an important issue that has become more pronounced in the last two decades due to the prevalence of social media. Platforms like Twitter, Reddit, and Facebook, have been commonly identified as the main channels for propagating misinformation and have been criticized for not acting on addressing the conditions that permit the circulation and amplification of false information [32]. Such misinformation includes false claims and non fact-checked news items, that originate from sources of questionable credibility [113]. The problem of misinformation becomes critical when it pertains to healthcare and health issues, since it puts lives and the public health at risk. One of the first cases of widely spread misinformation in the medical domain is the falsehood that the MMR vaccine (Measles, Mumps, Rubella) causes autism [109]. The falsehood originated from a fraudulent article titled "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" published in the prestigious Lancet journal in 1998 [171, 197]. This study turned tens of thousands of parents against the vaccine, and as a result, in 2020, many countries, including the United Kingdom, Greece, Venezuela, and Brazil, lost their measles elimination status. In 2020, twenty-two years after publishing this study Lancet retracted the paper [203].

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2410.1867

Country:

Europe > United Kingdom (0.47)
South America > Venezuela (0.24)
South America > Brazil (0.24)
(20 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Vaccines (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(6 more...)

Add feedback

Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval

Hwang, Dae Yon, Taha, Bilal, Pande, Harshit, Nechaev, Yaroslav

arXiv.org Artificial IntelligenceOct-24-2024

Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in https://github.com/eoduself/UDL

arxiv preprint arxiv, dataset, query, (13 more...)

arXiv.org Artificial Intelligence

2410.18385

Country:

North America > Canada > Ontario > Toronto (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States (0.04)
(7 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Mapping the Media Landscape: Predicting Factual Reporting and Political Bias Through Web Interactions

Sánchez-Cortés, Dairazalia, Burdisso, Sergio, Villatoro-Tello, Esaú, Motlicek, Petr

arXiv.org Artificial IntelligenceOct-23-2024

Bias assessment of news sources is paramount for professionals, organizations, and researchers who rely on truthful evidence for information gathering and reporting. While certain bias indicators are discernible from content analysis, descriptors like political bias and fake news pose greater challenges. In this paper, we propose an extension to a recently presented news media reliability estimation method that focuses on modeling outlets and their longitudinal web interactions. Concretely, we assess the classification performance of four reinforcement learning strategies on a large news media hyperlink graph. Our experiments, targeting two challenging bias descriptors, factual reporting and political bias, showed a significant performance improvement at the source media level. Additionally, we validate our methods on the CLEF 2023 CheckThat! Lab challenge, outperforming the reported results in both, F1-score and the official MAE metric. Furthermore, we contribute by releasing the largest annotated dataset of news source media, categorized with factual reporting and political bias labels. Our findings suggest that profiling news media sources based on their hyperlink interactions over time is feasible, offering a bird's-eye view of evolving media landscapes.

information retrieval, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-71736-9_7

2410.17655

Country:

Europe > Switzerland (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.37)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes

Schopf, Tim, Blatzheim, Alexander, Machner, Nektarios, Matthes, Florian

arXiv.org Artificial IntelligenceOct-21-2024

Scientific document classification is a critical task and often involves many classes. However, collecting human-labeled data for many classes is expensive and usually leads to label-scarce scenarios. Moreover, recent work has shown that sentence embedding model fine-tuning for few-shot classification is efficient, robust, and effective. In this work, we propose FusionSent (Fusion-based Sentence Embedding Fine-tuning), an efficient and prompt-free approach for few-shot classification of scientific documents with many classes. FusionSent uses available training examples and their respective label texts to contrastively fine-tune two different sentence embedding models. Afterward, the parameters of both fine-tuned models are fused to combine the complementary knowledge from the separate fine-tuning steps into a single model. Finally, the resulting sentence embedding model is frozen to embed the training instances, which are then used as input features to train a classification head. Our experiments show that FusionSent significantly outperforms strong baselines by an average of $6.0$ $F_{1}$ points across multiple scientific document classification datasets. In addition, we introduce a new dataset for multi-label classification of scientific documents, which contains 203,961 scientific articles and 130 classes from the arXiv category taxonomy. Code and data are available at https://github.com/sebischair/FusionSent.

classification, information retrieval, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2410.0577

Country:

North America > United States > New York > New York County > New York City (0.05)
Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(16 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)

Add feedback

A Survey of Conversational Search

Mo, Fengran, Mao, Kelong, Zhao, Ziliang, Qian, Hongjin, Chen, Haonan, Cheng, Yiruo, Li, Xiaoxi, Zhu, Yutao, Dou, Zhicheng, Nie, Jian-Yun

arXiv.org Artificial IntelligenceOct-20-2024

As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.15576

Country:

Asia > Singapore (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
(26 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine (1.00)
Banking & Finance > Trading (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.46)

Add feedback

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

Lv, Qitan, Wang, Jie, Chen, Hanzhu, Li, Bin, Zhang, Yongdong, Wu, Feng

arXiv.org Artificial IntelligenceOct-19-2024

Generation of plausible but incorrect factual information, often termed hallucination, has attracted significant research interest. Retrieval-augmented language model (RALM) -- which enhances models with up-to-date knowledge -- emerges as a promising method to reduce hallucination. However, existing RALMs may instead exacerbate hallucination when retrieving lengthy contexts. To address this challenge, we propose COFT, a novel \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing method to focus on different granularity-level key texts, thereby avoiding getting lost in lengthy contexts. Specifically, COFT consists of three components: \textit{recaller}, \textit{scorer}, and \textit{selector}. First, \textit{recaller} applies a knowledge graph to extract potential key entities in a given context. Second, \textit{scorer} measures the importance of each entity by calculating its contextual weight. Finally, \textit{selector} selects high contextual weight entities with a dynamic threshold algorithm and highlights the corresponding paragraphs, sentences, or words in a coarse-to-fine manner. Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30\%$ in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.

coft, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.15116

Country:

North America > United States (0.93)
Asia > China (0.04)
Europe > Russia (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Education (1.00)
Energy > Power Industry > Utilities > Nuclear (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.92)

Add feedback

Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining

Meghwani, Hansa

arXiv.org Artificial IntelligenceOct-18-2024

Ranking consistently emerges as a primary focus in information retrieval research. Retrieval and ranking models serve as the foundation for numerous applications, including web search, open domain QA, enterprise domain QA, and text-based recommender systems. Typically, these models undergo training on triplets consisting of binary relevance assignments, comprising one positive and one negative passage. However, their utilization involves a context where a significantly more nuanced understanding of relevance is necessary, especially when re-ranking a large pool of potentially relevant passages. Although collecting positive examples through user feedback like impressions or clicks is straightforward, identifying suitable negative pairs from a vast pool of possibly millions or even billions of documents possess a greater challenge. Generating a substantial number of negative pairs is often necessary to maintain the high quality of the model. Several approaches have been suggested in literature to tackle the issue of selecting suitable negative pairs from an extensive corpus. This study focuses on explaining the crucial role of hard negatives in the training process of cross-encoder models, specifically aiming to explain the performance gains observed with hard negative sampling compared to random sampling. We have developed a robust hard negative mining technique for efficient training of cross-encoder re-rank models on an enterprise dataset which has domain specific context. We provide a novel perspective to enhance retrieval models, ultimately influencing the performance of advanced LLM systems like Retrieval-Augmented Generation (RAG) and Reasoning and Action Agents (ReAct). The proposed approach demonstrates that learning both similarity and dissimilarity simultaneously with cross-encoders improves performance of retrieval systems.

information retrieval, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2411.02404

Country:

North America > United States (0.04)
Asia > India (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry:

Information Technology (0.67)
Materials > Metals & Mining (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Efficiently Computing Susceptibility to Context in Language Models

Liu, Tianyu, Du, Kevin, Sachan, Mrinmaya, Cotterell, Ryan

arXiv.org Artificial IntelligenceOct-18-2024

One strength of modern language models is their ability to incorporate information from a user-input context when answering queries. However, they are not equally sensitive to the subtle changes to that context. To quantify this, Du et al. (2024) gives an information-theoretic metric to measure such sensitivity. Their metric, susceptibility, is defined as the degree to which contexts can influence a model's response to a query at a distributional level. However, exactly computing susceptibility is difficult and, thus, Du et al. (2024) falls back on a Monte Carlo approximation. Due to the large number of samples required, the Monte Carlo approximation is inefficient in practice. As a faster alternative, we propose Fisher susceptibility, an efficient method to estimate the susceptibility based on Fisher information. Empirically, we validate that Fisher susceptibility is comparable to Monte Carlo estimated susceptibility across a diverse set of query domains despite its being $70\times$ faster. Exploiting the improved efficiency, we apply Fisher susceptibility to analyze factors affecting the susceptibility of language models. We observe that larger models are as susceptible as smaller ones.

large language model, machine learning, susceptibility, (18 more...)

arXiv.org Artificial Intelligence

2410.14361

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Ireland (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.56)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

SwaQuAD-24: QA Benchmark Dataset in Swahili

Kondoro, Alfred Malengo

arXiv.org Artificial IntelligenceOct-18-2024

This paper proposes the creation of a Swahili Question Answering (QA) benchmark dataset, aimed at addressing the underrepresentation of Swahili in natural language processing (NLP). Drawing from established benchmarks like SQuAD, GLUE, KenSwQuAD, and KLUE, the dataset will focus on providing high-quality, annotated question-answer pairs that capture the linguistic diversity and complexity of Swahili. The dataset is designed to support a variety of applications, including machine translation, information retrieval, and social services like healthcare chatbots. Ethical considerations, such as data privacy, bias mitigation, and inclusivity, are central to the dataset's development. Additionally, the paper outlines future expansion plans to include domain-specific content, multimodal integration, and broader crowdsourcing efforts. The Swahili QA dataset aims to foster technological innovation in East Africa and provide an essential resource for NLP research and applications in low-resource languages. The East Africa region boasts a rich Swahili linguistic heritage, with the language being spoken by millions across the region [1]. Tanzania promoted Swahili to national language status in favour of other ethnic languages as part of efforts to foster national unity.

artificial intelligence, information retrieval, natural language, (13 more...)

arXiv.org Artificial Intelligence

2410.14289

Country:

Africa > East Africa (0.46)
Africa > Tanzania (0.25)
Asia > South Korea > Seoul > Seoul (0.04)
(3 more...)

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (1.00)
Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback