AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Query Complexity of Clustering with Side Information

Arya Mazumdar, Barna Saha

Neural Information Processing SystemsOct-2-2024, 15:35:36 GMT

Suppose, we are given a set of n elements to be clustered into k (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, "do two elements u and v belong to the same cluster?". The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we provide a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and give strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. However, obtaining an ideal similarity function is extremely challenging due to ambiguity in data representation, poor data quality etc., and this is one of the primary reasons that makes clustering hard.

algorithm, query, side information, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > Nevada > Clark County > Las Vegas (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

PairDistill: Pairwise Relevance Distillation for Dense Retrieval

Huang, Chao-Wei, Chen, Yun-Nung

arXiv.org Artificial IntelligenceOct-2-2024

Effective information retrieval (IR) from vast datasets relies on advanced techniques to extract relevant information in response to queries. Recent advancements in dense retrieval have showcased remarkable efficacy compared to traditional sparse retrieval methods. To further enhance retrieval performance, knowledge distillation techniques, often leveraging robust cross-encoder rerankers, have been extensively explored. However, existing approaches primarily distill knowledge from pointwise rerankers, which assign absolute relevance scores to documents, thus facing challenges related to inconsistent comparisons. This paper introduces Pairwise Relevance Distillation (PairDistill) to leverage pairwise reranking, offering fine-grained distinctions between similarly relevant documents to enrich the training of dense retrieval models. Our experiments demonstrate that PairDistill outperforms existing methods, achieving new state-of-the-art results across multiple benchmarks. This highlights the potential of PairDistill in advancing dense retrieval techniques effectively. Our source code and trained models are released at https://github.com/MiuLab/PairDistill

distillation, retrieval, retriever, (14 more...)

arXiv.org Artificial Intelligence

2410.01383

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Texas (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Add feedback

Conversational Exploratory Search of Scholarly Publications Using Knowledge Graphs

Schneider, Phillip, Matthes, Florian

arXiv.org Artificial IntelligenceOct-1-2024

Traditional search methods primarily depend on string matches, while semantic search targets concept-based matches by recognizing underlying intents and contextual meanings of search terms. Semantic search is particularly beneficial for discovering scholarly publications where differences in vocabulary between users' search terms and document content are common, often yielding irrelevant search results. Many scholarly search engines have adopted knowledge graphs to represent semantic relations between authors, publications, and research concepts. However, users may face challenges when navigating these graphical search interfaces due to the complexity and volume of data, which impedes their ability to discover publications effectively. To address this problem, we developed a conversational search system for exploring scholarly publications using a knowledge graph. We outline the methodical approach for designing and implementing the proposed system, detailing its architecture and functional components. To assess the system's effectiveness, we employed various performance metrics and conducted a human evaluation with 40 participants, demonstrating how the conversational interface compares against a graphical interface with traditional text search. The findings from our evaluation provide practical insights for advancing the design of conversational search systems.

conversational interface, interface, publication, (14 more...)

arXiv.org Artificial Intelligence

2410.00427

Country:

Asia > China > Hong Kong (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Education > Educational Setting (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(5 more...)

Add feedback

T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework

Li, Lan, Fang, Liri, Liu, Yiren, Torvik, Vetle I., Ludaescher, Bertram

arXiv.org Artificial IntelligenceSep-30-2024

Entity resolution (ER) is the process of determining whether two representations refer to the same real-world entity and plays a crucial role in data curation and data cleaning. Recent studies have introduced the KAER framework, aiming to improve pre-trained language models by augmenting external knowledge. However, identifying and documenting the external knowledge that is being augmented and understanding its contribution to the model's predictions have received little to no attention in the research community. This paper addresses this gap by introducing T-KAER, the Transparency-enhanced Knowledge-Augmented Entity Resolution framework. To enhance transparency, three Transparency-related Questions (T-Qs) have been proposed: T-Q(1): What is the experimental process for matching results based on data inputs? T-Q(2): Which semantic information does KAER augment in the raw data inputs? T-Q(3): Which semantic information of the augmented data inputs influences the predictions? To address the T-Qs, T-KAER is designed to improve transparency by documenting the entity resolution processes in log files. In experiments, a citation dataset is used to demonstrate the transparency components of T-KAER. This demonstration showcases how T-KAER facilitates error analysis from both quantitative and qualitative perspectives, providing evidence on "what" semantic information is augmented and "why" the augmented knowledge influences predictions differently.

doduo, information, sherlock, (14 more...)

arXiv.org Artificial Intelligence

2410.00218

Country: North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

QAEncoder: Towards Aligned Representation Learning in Question Answering System

Wang, Zhengren, Yu, Qinhan, Wei, Shida, Li, Zhiyu, Xiong, Feiyu, Wang, Xiaoxing, Niu, Simin, Liang, Hao, Zhang, Wentao

arXiv.org Artificial IntelligenceSep-30-2024

Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. Motivated by our conical distribution hypothesis, which posits that potential queries and documents form a cone-like structure in the embedding space, we introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments on fourteen embedding models across six languages and eight datasets validate QAEncoder's alignment capability, which offers a plug-and-play solution that seamlessly integrates with existing RAG architectures and training-based methods.

dataset, qae hyb, query, (14 more...)

arXiv.org Artificial Intelligence

2409.20434

Country:

North America > United States (0.46)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry:

Government > Space Agency (0.46)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Cross-Domain Keyword Extraction with Keyness Patterns

Zhou, Dongmei, Tang, Xuri

arXiv.org Artificial IntelligenceSep-27-2024

Domain dependence and annotation subjectivity pose challenges for supervised keyword extraction. Based on the premises that second-order keyness patterns are existent at the community level and learnable from annotated keyword extraction datasets, this paper proposes a supervised ranking approach to keyword extraction that ranks keywords with keyness patterns consisting of independent features (such as sublanguage domain and term length) and three categories of dependent features -- heuristic features, specificity features, and representavity features. The approach uses two convolutional-neural-network based models to learn keyness patterns from keyword datasets and overcomes annotation subjectivity by training the two models with bootstrap sampling strategy. Experiments demonstrate that the approach not only achieves state-of-the-art performance on ten keyword datasets in general supervised keyword extraction with an average top-10-F-measure of 0.316 , but also robust cross-domain performance with an average top-10-F-measure of 0.346 on four datasets that are excluded in the training process. Such cross-domain robustness is attributed to the fact that community-level keyness patterns are limited in number and temperately independent of language domains, the distinction between independent features and dependent features, and the sampling training strategy that balances excess risk and lack of negative training data.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.18724

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Media (0.67)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Incorporating Precedents for Legal Judgement Prediction on European Court of Human Rights Cases

Santosh, T. Y. S. S., Elganayni, Mohamed Hesham, Sójka, Stanisław, Grabmair, Matthias

arXiv.org Artificial IntelligenceSep-27-2024

Inspired by the legal doctrine of stare decisis, which leverages precedents (prior cases) for informed decision-making, we explore methods to integrate them into LJP models. To facilitate precedent retrieval, we train a retriever with a fine-grained relevance signal based on the overlap ratio of alleged articles between cases. We investigate two strategies to integrate precedents: direct incorporation at inference via label interpolation based on case proximity and during training via a precedent fusion module using a stacked-cross attention model. We employ joint training of the retriever and LJP models to address latent space divergence between them. Our experiments on LJP tasks from the ECHR jurisdiction reveal that integrating precedents during training coupled with joint training of the retriever and LJP model, outperforms models without precedents or with precedents incorporated only at inference, particularly benefiting sparser articles.

precedent, representation, retriever, (16 more...)

arXiv.org Artificial Intelligence

2409.18644

Country:

North America > United States (0.28)
Europe > Switzerland (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law > Civil Rights & Constitutional Law (0.52)
Government > Regional Government (0.46)
Law > International Law (0.41)
Government > Intergovernmental Programs (0.41)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

Open-World Evaluation for Retrieving Diverse Perspectives

Chen, Hung-Ting, Choi, Eunsol

arXiv.org Artificial IntelligenceSep-26-2024

We study retrieving a set of documents that covers various perspectives on a complex and contentious question (e.g., will ChatGPT do more harm than good?). We curate a Benchmark for Retrieval Diversity for Subjective questions (BERDS), where each example consists of a question and diverse perspectives associated with the question, sourced from survey questions and debate websites. On this data, retrievers paired with a corpus are evaluated to surface a document set that contains diverse perspectives. Our framing diverges from most retrieval tasks in that document relevancy cannot be decided by simple string matches to references. Instead, we build a language model based automatic evaluator that decides whether each retrieved document contains a perspective. This allows us to evaluate the performance of three different types of corpus (Wikipedia, web snapshot, and corpus constructed on the fly with retrieved pages from the search engine) paired with retrievers. Retrieving diverse documents remains challenging, with the outputs from existing retrievers covering all perspectives on only 33.74% of the examples. We further study the impact of query expansion and diversity-focused reranking approaches and analyze retriever sycophancy. Together, we lay the foundation for future studies in retrieval diversity handling complex queries.

computational linguistic, corpus, retriever, (15 more...)

arXiv.org Artificial Intelligence

2409.1811

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(17 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
Information Technology (0.69)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Enhancing Structured-Data Retrieval with GraphRAG: Soccer Data Case Study

Sepasdar, Zahra, Gautam, Sushant, Midoglu, Cise, Riegler, Michael A., Halvorsen, Pål

arXiv.org Artificial IntelligenceSep-26-2024

Extracting meaningful insights from large and complex datasets poses significant challenges, particularly in ensuring the accuracy and relevance of retrieved information. Traditional data retrieval methods such as sequential search and index-based retrieval often fail when handling intricate and interconnected data structures, resulting in incomplete or misleading outputs. To overcome these limitations, we introduce Structured-GraphRAG, a versatile framework designed to enhance information retrieval across structured datasets in natural language queries. Structured-GraphRAG utilizes multiple knowledge graphs, which represent data in a structured format and capture complex relationships between entities, enabling a more nuanced and comprehensive retrieval of information. This graph-based approach reduces the risk of errors in language model outputs by grounding responses in a structured format, thereby enhancing the reliability of results. We demonstrate the effectiveness of Structured-GraphRAG by comparing its performance with that of a recently published method using traditional retrieval-augmented generation. Our findings show that Structured-GraphRAG significantly improves query processing efficiency and reduces response times. While our case study focuses on soccer data, the framework's design is broadly applicable, offering a powerful tool for data analysis and enhancing language model applications across various structured domains.

dataset, information, node, (12 more...)

arXiv.org Artificial Intelligence

2409.1758

Country:

Europe > Norway > Eastern Norway > Oslo (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Europe > United Kingdom > England > Dorset > Bournemouth (0.04)
(3 more...)

Genre: Research Report > New Finding (0.86)

Industry: Leisure & Entertainment > Sports > Soccer (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.90)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

Liang, Lei, Sun, Mengshu, Gui, Zhengke, Zhu, Zhongshu, Jiang, Zhouyu, Zhong, Ling, Qu, Yuan, Zhao, Peilong, Bo, Zhongpu, Yang, Jin, Xiong, Huaidong, Yuan, Lin, Xu, Jun, Wang, Zaoyang, Zhang, Zhiqiang, Zhang, Wen, Chen, Huajun, Chen, Wenguang, Zhou, Jun

arXiv.org Artificial IntelligenceSep-26-2024

The recently developed retrieval-augmented generation (RAG) technology has enabled the efficient construction of domain-specific applications. However, it also has limitations, including the gap between vector similarity and the relevance of knowledge reasoning, as well as insensitivity to knowledge logic, such as numerical values, temporal relations, expert rules, and others, which hinder the effectiveness of professional knowledge services. In this work, we introduce a professional domain knowledge service framework called Knowledge Augmented Generation (KAG). KAG is designed to address the aforementioned challenges with the motivation of making full use of the advantages of knowledge graph(KG) and vector retrieval, and to improve generation and reasoning performance by bidirectionally enhancing large language models (LLMs) and KGs through five key aspects: (1) LLM-friendly knowledge representation, (2) mutual-indexing between knowledge graphs and original chunks, (3) logical-form-guided hybrid reasoning engine, (4) knowledge alignment with semantic reasoning, and (5) model capability enhancement for KAG. We compared KAG with existing RAG methods in multihop question answering and found that it significantly outperforms state-of-the-art methods, achieving a relative improvement of 19.6% on hotpotQA and 33.5% on 2wiki in terms of F1 score. We have successfully applied KAG to two professional knowledge Q&A tasks of Ant Group, including E-Government Q&A and E-Health Q&A, achieving significant improvement in professionalism compared to RAG methods. Furthermore, we will soon natively support KAG on the opensource KG engine OpenSPG, allowing developers to more easily build rigorous knowledge decision-making or convenient information retrieval services. This will facilitate the localized development of KAG, enabling developers to build domain knowledge services with higher accuracy and efficiency.

knowledge, reasoning, retrieval, (13 more...)

arXiv.org Artificial Intelligence

2409.13731

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
(15 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Government (1.00)
Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback