AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Inducing Diversity in Differentiable Search Indexing

Phatak, Abhijeet, Sachdev, Jayant, Rosario, Sean D, Kirti, Swati, Tripathy, Chittaranjan

arXiv.org Artificial IntelligenceFeb-4-2025

Differentiable Search Indexing (DSI) is a recent paradigm for information retrieval which uses a transformer-based neural network architecture as the document index to simplify the retrieval process. A differentiable index has many advantages enabling modifications, updates or extensions to the index. In this work, we explore balancing relevance and novel information content (diversity) for training DSI systems inspired by Maximal Marginal Relevance (MMR), and show the benefits of our approach over the naive DSI training. We present quantitative and qualitative evaluations of relevance and diversity measures obtained using our method on NQ320K and MSMARCO datasets in comparison to naive DSI. With our approach, it is possible to achieve diversity without any significant impact to relevance. Since we induce diversity while training DSI, the trained model has learned to diversify while being relevant. This obviates the need for a post-processing step to induce diversity in the recall set as typically performed using MMR. Our approach will be useful for Information Retrieval problems where both relevance and diversity are important such as in sub-topic retrieval. Our work can also be easily be extended to the incremental DSI settings which would enable fast updates to the index while retrieving a diverse recall set.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.02788

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Sacramento County > Sacramento (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

COVE: COntext and VEracity prediction for out-of-context images

Tonglet, Jonathan, Thiem, Gabriel, Gurevych, Iryna

arXiv.org Artificial IntelligenceFeb-3-2025

Images taken out of their context are the most prevalent form of multimodal misinformation. Debunking them requires (1) providing the true context of the image and (2) checking the veracity of the image's caption. However, existing automated fact-checking methods fail to tackle both objectives explicitly. In this work, we introduce COVE, a new method that predicts first the true COntext of the image and then uses it to predict the VEracity of the caption. COVE beats the SOTA context prediction model on all context items, often by more than five percentage points. It is competitive with the best veracity prediction models on synthetic data and outperforms them on real-world data, showing that it is beneficial to combine the two tasks sequentially. Finally, we conduct a human study that reveals that the predicted context is a reusable and interpretable artifact to verify new out-of-context captions for the same image. Our code and data are made available.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2502.01194

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Singapore (0.04)
(20 more...)

Genre: Research Report (0.64)

Industry: Media > News (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Query Brand Entity Linking in E-Commerce Search

Liu, Dong, Nag, Sreyashi

arXiv.org Artificial IntelligenceFeb-3-2025

Western brand name written in its original form versus its representation in Asian scripts), (ii) different surface forms for the same In this work, we address the brand entity linking problem for e-brand (e.g., abbreviations versus full names) and (iii) identifying commerce search queries. The entity linking task is done by either i) brand relationships between parent and sub-brands (e.g., a parent a two-stage process consisting of entity mention detection followed company and its product line brands). Therefore, in addition to by entity disambiguation or ii) an end-to-end linking approaches recognizing the brand names mentioned in the query, it is also that directly fetch the target entity given the input text. The task important to link them to the corresponding global brand entity. It presents unique challenges: queries are extremely short (averaging would be valuable to unify the concept of brand across different e-2.4 words), lack natural language structure, and must handle a commercial stores in a single namespace, i.e., brand entity (identity massive space of unique brands. We present a two-stage approach to each brand itself). Each brand entity is is unique across languages, combining named-entity recognition with matching, and a novel stores and surface forms. As part of this effort, we aim to recognize end-to-end solution using extreme multi-class classification.

brand entity, brand name, query, (14 more...)

arXiv.org Artificial Intelligence

2502.01555

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > District of Columbia > Washington (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)

Add feedback

Predicting potentially unfair clauses in Chilean terms of services with natural language processing

Loeffler, Christoffer, Freile, Andrea Martínez, Pizarro, Tomás Rey

arXiv.org Artificial IntelligenceFeb-2-2025

This study addresses the growing concern of information asymmetry in consumer contracts, exacerbated by the proliferation of online services with complex Terms of Service that are rarely even read. Even though research on automatic analysis methods is conducted, the problem is aggravated by the general focus on English-language Machine Learning approaches and on major jurisdictions, such as the European Union. We introduce a new methodology and a substantial dataset addressing this gap. We propose a novel annotation scheme with four categories and a total of 20 classes, and apply it on 50 online Terms of Service used in Chile. Our evaluation of transformer-based models highlights how factors like language- and/or domain-specific pre-training, few-shot sample size, and model architecture affect the detection and classification of potentially abusive clauses. Results show a large variability in performance for the different tasks and models, with the highest macro-F1 scores for the detection task ranging from 79% to 89% and micro-F1 scores up to 96%, while macro-F1 scores for the classification task range from 60% to 70% and micro-F1 scores from 64% to 80%. Notably, this is the first Spanish-language multi-label classification dataset for legal clauses, applying Chilean law and offering a comprehensive evaluation of Spanish-language models in the legal domain. Our work lays the ground for future research in method development for rarely considered legal analysis and potentially leads to practical applications to support consumers in Chile and Latin America as a whole.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.00865

Country:

North America > Central America (0.24)
South America > Brazil (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Government > Regional Government > Europe Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)

Add feedback

Improving DBMS Scheduling Decisions with Fine-grained Performance Prediction on Concurrent Queries -- Extended

Wu, Ziniu, Markakis, Markos, Liu, Chunwei, Chen, Peter Baile, Narayanaswamy, Balakrishnan, Kraska, Tim, Madden, Samuel

arXiv.org Artificial IntelligenceJan-31-2025

Query scheduling is a critical task that directly impacts query performance in database management systems (DBMS). Deeply integrated schedulers, which require changes to DBMS internals, are usually customized for a specific engine and can take months to implement. In contrast, non-intrusive schedulers make coarse-grained decisions, such as controlling query admission and re-ordering query execution, without requiring modifications to DBMS internals. They require much less engineering effort and can be applied across a wide range of DBMS engines, offering immediate benefits to end users. However, most existing non-intrusive scheduling systems rely on simplified cost models and heuristics that cannot accurately model query interactions under concurrency and different system states, possibly leading to suboptimal scheduling decisions. This work introduces IconqSched, a new, principled non-intrusive scheduler that optimizes the execution order and timing of queries to enhance total end-to-end runtime as experienced by the user query queuing time plus system runtime. Unlike previous approaches, IconqSched features a novel fine-grained predictor, Iconq, which treats the DBMS as a black box and accurately estimates the system runtime of concurrently executed queries under different system states. Using these predictions, IconqSched is able to capture system runtime variations across different query mixes and system loads. It then employs a greedy scheduling algorithm to effectively determine which queries to submit and when to submit them. We compare IconqSched to other schedulers in terms of end-to-end runtime using real workload traces. On Postgres, IconqSched reduces end-to-end runtime by 16.2%-28.2% on average and 33.6%-38.9% in the tail. Similarly, on Redshift, it reduces end-to-end runtime by 10.3%-14.1% on average and 14.9%-22.2% in the tail.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.16256

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Greater London > London (0.05)
North America > United States > New York > New York County > New York City (0.04)
(10 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.67)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Comprehensive Survey on Legal Summarization: Challenges and Future Directions

Akter, Mousumi, Çano, Erion, Weber, Erik, Dobler, Dennis, Habernal, Ivan

arXiv.org Artificial IntelligenceJan-29-2025

The constant engagement with extensive written materials is fundamental and immensely time-consuming [104]. Legal professionals often spend hours, if not days, combing through documents to find precedents or relevant cases that could be pivotal to their current cases. This laborious process is a significant part of the workload of legal professionals like lawyers and judges, taking up lots of time that could be invested otherwise. Automatic summarization tools could help to condense lengthy legal documents into concise summaries, helping to save both time and costs. Moreover, integrating advanced Natural Language Processing (NLP) techniques into legal research holds significant promise for democratizing access to legal information. Figure 1 shows the general pipeline for legal summarization. Compared to other domains, legal texts present unique challenges that distinguish them from other document types. Legal documents tend to be longer and more detailed than those from other domains.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2501.1783

Country:

Europe > Germany (0.14)
Oceania > Australia (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(34 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law > Litigation (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Regional Government > Europe Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Reqo: A Robust and Explainable Query Optimization Cost Model

Chang, Baoming, Kamali, Amin, Kantere, Verena

arXiv.org Artificial IntelligenceJan-28-2025

In recent years, there has been a growing interest in using machine learning (ML) in query optimization to select more efficient plans. Existing learning-based query optimizers use certain model architectures to convert tree-structured query plans into representations suitable for downstream ML tasks. As the design of these architectures significantly impacts cost estimation, we propose a tree model architecture based on Bidirectional Graph Neural Networks (Bi-GNN) aggregated by Gated Recurrent Units (GRUs) to achieve more accurate cost estimates. The inherent uncertainty of data and model parameters also leads to inaccurate cost estimates, resulting in suboptimal plans and less robust query performance. To address this, we implement a novel learning-to-rank cost model that effectively quantifies the uncertainty in cost estimates using approximate probabilistic ML. This model adaptively integrates quantified uncertainty with estimated costs and learns from comparing pairwise plans, achieving more robust performance. In addition, we propose the first explainability technique specifically designed for learning-based cost models. This technique explains the contribution of any subgraphs in the query plan to the final predicted cost, which can be integrated and trained with any learning-based cost model to significantly boost the model's explainability. By incorporating these innovations, we propose a cost model for a Robust and Explainable Query Optimizer, Reqo, that improves the accuracy, robustness, and explainability of cost estimation, outperforming state-of-the-art approaches in all three dimensions.

contribution, cost model, query plan, (15 more...)

arXiv.org Artificial Intelligence

2501.17414

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > New York > New York County > New York City (0.04)
(9 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Survey: Understand the challenges of MachineLearning Experts using Named EntityRecognition Tools

Freund, Florian, Tamla, Philippe, Hemmje, Matthias

arXiv.org Artificial IntelligenceJan-27-2025

This paper presents a survey based on Kasunic's survey research methodology to identify the criteria used by Machine Learning (ML) experts to evaluate Named Entity Recognition (NER) tools and frameworks. Comparison and selection of NER tools and frameworks is a critical step in leveraging NER for Information Retrieval to support the development of Clinical Practice Guidelines. In addition, this study examines the main challenges faced by ML experts when choosing suitable NER tools and frameworks. Using Nunamaker's methodology, the article begins with an introduction to the topic, contextualizes the research, reviews the state-of-the-art in science and technology, and identifies challenges for an expert survey on NER tools and frameworks. This is followed by a description of the survey's design and implementation. The paper concludes with an evaluation of the survey results and the insights gained, ending with a summary and conclusions.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.5121/csit.2024.150208

2501.16112

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
(21 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Information Technology > Services (1.00)
Health & Medicine (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial IntelligenceJan-27-2025

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.

data mining, information retrieval, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2501.16605

Country:

Oceania > Australia > Tasmania (0.04)
Europe > United Kingdom (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Law (1.00)
Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Data Science > Data Quality (1.00)
(5 more...)

Add feedback

3CEL: A corpus of legal Spanish contract clauses

García, Nuria Aldama, Morales, Patricia Marsà, Sánchez, David Betancur, Jiménez, Álvaro Barbero, Nieto, Marta Guerrero, Coll, Pablo Haya, Chozas, Patricia Martín, Ponsoda, Elena Montiel

arXiv.org Artificial IntelligenceJan-27-2025

Information extraction (IE) is defined as the NLP task that deals with the identification of particular pieces of information in unstructured documents [1, 2, 3]. In other words, the main objective of IE is to spot predefined relevant information in raw text. IE includes different subtypes depending on the nature of the information to be extracted. Thus, Named Entity Recognition (NER), Co-Reference Resolution, Relation Extraction or Event Extraction are encompassed under the umbrella of IE [2]. IE encounters specific challenges, particularly with regard to data availability and the need for expert knowledge. First, access to raw data is limited depending on the target domain (e.g.

category, contract, information, (12 more...)

arXiv.org Artificial Intelligence

2501.1599

Country:

Europe > Spain > Galicia > Madrid (0.07)
North America > United States (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Europe > Greece (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback