AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Geometry-Aware Active Learning of Pattern Rankings via Choquet-Based Aggregation

Opran, Tudor Matei, Loudni, Samir

arXiv.org Artificial IntelligenceJul-22-2025

We address the pattern explosion problem in pattern mining by proposing an interactive learning framework that combines nonlinear utility aggregation with geometry-aware query selection. Our method models user preferences through a Choquet integral over multiple interestingness measures and exploits the geometric structure of the version space to guide the selection of informative comparisons. A branch-and-bound strategy with tight distance bounds enables efficient identification of queries near the decision boundary. Experiments on UCI datasets show that our approach outperforms existing methods such as ChoquetRank, achieving better ranking accuracy with fewer user interactions.

data mining, information retrieval, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2507.14217

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (0.40)

Industry: Education (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.67)

Add feedback

Let's Measure the Elephant in the Room: Facilitating Personalized Automated Analysis of Privacy Policies at Scale

Zhao, Rui, Melnychuk, Vladyslav, Zhao, Jun, Wright, Jesse, Shadbolt, Nigel

arXiv.org Artificial IntelligenceJul-22-2025

In modern times, people have numerous online accounts, but they rarely read the Terms of Service or Privacy Policy of those sites despite claiming otherwise. This paper introduces PoliAnalyzer, a neuro-symbolic system that assists users with personalized privacy policy analysis. PoliAnalyzer uses Natural Language Processing (NLP) to extract formal representations of data usage practices from policy texts. In favor of deterministic, logical inference is applied to compare user preferences with the formal privacy policy representation and produce a compliance report. To achieve this, we extend an existing formal Data Terms of Use policy language to model privacy policies as app policies and user preferences as data policies. In our evaluation using our enriched PolicyIE dataset curated by legal experts, PoliAnalyzer demonstrated high accuracy in identifying relevant data usage practices, achieving F1-score of 90-100% across most tasks. Additionally, we demonstrate how PoliAnalyzer can model diverse user data-sharing preferences, derived from prior research as 23 user profiles, and perform compliance analysis against the top 100 most-visited websites. This analysis revealed that, on average, 95.2% of a privacy policy's segments do not conflict with the analyzed user preferences, enabling users to concentrate on understanding the 4.8% (636 / 13205) that violates preferences, significantly reducing cognitive burden. Further, we identified common practices in privacy policies that violate user expectations - such as the sharing of location data with 3rd parties. This paper demonstrates that PoliAnalyzer can support automated personalized privacy policy analysis at scale using off-the-shelf NLP tools. This sheds light on a pathway to help individuals regain control over their data and encourage societal discussions on platform data practices to promote a fairer power dynamic.

information retrieval, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.14214

Country:

Europe (1.00)
North America > United States (0.14)

Genre: Research Report (0.82)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

Graph-Structured Data Analysis of Component Failure in Autonomous Cargo Ships Based on Feature Fusion

Zhang, Zizhao, Zhao, Tianxiang, Sun, Yu, Sun, Liping, Kang, Jichuan

arXiv.org Artificial IntelligenceJul-21-2025

To address the challenges posed by cascading reactions caused by component failures in autonomous cargo ships (ACS) and the uncertainties in emergency decision-making, this paper proposes a novel hybrid feature fusion framework for constructing a graph-structured dataset of failure modes. By employing an improved cuckoo search algorithm (HN-CSA), the literature retrieval efficiency is significantly enhanced, achieving improvements of 7.1% and 3.4% compared to the NSGA-II and CSA search algorithms, respectively. A hierarchical feature fusion framework is constructed, using Word2Vec encoding to encode subsystem/component features, BERT-KPCA to process failure modes/reasons, and Sentence-BERT to quantify the semantic association between failure impact and emergency decision-making. The dataset covers 12 systems, 1,262 failure modes, and 6,150 propagation paths. Validation results show that the GATE-GNN model achieves a classification accuracy of 0.735, comparable to existing benchmarks. Additionally, a silhouette coefficient of 0.641 indicates that the features are highly distinguishable. In the label prediction results, the Shore-based Meteorological Service System achieved an F1 score of 0.93, demonstrating high prediction accuracy. This paper not only provides a solid foundation for failure analysis in autonomous cargo ships but also offers reliable support for fault diagnosis, risk assessment, and intelligent decision-making systems. The link to the dataset is https://github.com/wojiufukele/Graph-Structured-about-CSA.

evolutionary algorithm, information retrieval, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.13721

Country: Asia > China (0.68)

Genre: Research Report > New Finding (0.66)

Industry:

Transportation > Marine (0.90)
Transportation > Freight & Logistics Services > Shipping (0.90)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
(8 more...)

Add feedback

GraphTrafficGPT: Enhancing Traffic Management Through Graph-Based AI Agent Coordination

Taleb, Nabil Abdelaziz Ferhat, Rezaei, Abdolazim, Patel, Raj Atulkumar, Sookhak, Mehdi

arXiv.org Artificial IntelligenceJul-21-2025

--Large Language Models (LLMs) offer significant promise for intelligent traffic management; however, current chain-based systems like TrafficGPT are hindered by sequential task execution, high token usage, and poor scalability, making them inefficient for complex, real-world scenarios. T o address these limitations, we propose GraphTrafficGPT, a novel graph-based architecture, which fundamentally redesigns the task coordination process for LLM-driven traffic applications. Graph-TrafficGPT represents tasks and their dependencies as nodes and edges in a directed graph, enabling efficient parallel execution and dynamic resource allocation. The main idea behind the proposed model is a Brain Agent that decomposes user queries, constructs optimized dependency graphs, and coordinates a network of specialized agents for data retrieval, analysis, visualization, and simulation. By introducing advanced context-aware token management and supporting concurrent multi-query processing, the proposed architecture handles interdependent tasks typical of modern urban mobility environments. Experimental results demonstrate that GraphTrafficGPT reduces token consumption by 50.2% and average response latency by 19.0% compared to TrafficGPT, while supporting simultaneous multi-query execution with up to 23.0% improvement in efficiency. Large Language Models (LLMs) have changed artificial intelligence capabilities across domains by enabling natural language understanding and generation at new levels. The recent models, such as GPT -4, Claude, and Llama, can comprehend complex instructions, reason through problems, and generate coherent responses across diverse applications [1].

graphtrafficgpt, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2507.13511

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.89)

Industry: Transportation > Infrastructure & Services (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Data Transformation Strategies to Remove Heterogeneity

Yoo, Sangbong, Lee, Jaeyoung, Yoon, Chanyoung, Son, Geonyeong, Hong, Hyein, Seo, Seongbum, Yim, Soobin, Jung, Chanyoung, Park, Jungsoo, Kim, Misuk, Jang, Yun

arXiv.org Artificial IntelligenceJul-18-2025

Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.

data mining, large language model, machine learning, (25 more...)

arXiv.org Artificial Intelligence

2507.12677

Country:

North America > United States > California (1.00)
Europe > Germany (0.68)
North America > Canada (0.68)
(3 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (0.92)
Information Technology (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(7 more...)

Add feedback

Social and Political Framing in Search Engine Results

Poudel, Amrit, Weninger, Tim

arXiv.org Artificial IntelligenceJul-18-2025

Search engines play a crucial role in shaping public discourse by influencing how information is accessed and framed. While prior research has extensively examined various dimensions of search bias -- such as content prioritization, indexical bias, political polarization, and sources of bias -- an important question remains underexplored: how do search engines and ideologically-motivated user queries contribute to bias in search results. This study analyzes the outputs of major search engines using a dataset of political and social topics. The findings reveal that search engines not only prioritize content in ways that reflect underlying biases but also that ideologically-driven user queries exacerbate these biases, resulting in the amplification of specific narratives. Moreover, significant differences were observed across search engines in terms of the sources they prioritize. These results suggest that search engines may play a pivotal role in shaping public perceptions by reinforcing ideological divides, thereby contributing to the broader issue of information polarization.

artificial intelligence, information retrieval, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.13325

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.46)

Industry:

Media > News (1.00)
Government > Voting & Elections (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

The benefits of query-based KGQA systems for complex and temporal questions in LLM era

Alekseev, Artem, Chaichuk, Mikhail, Butko, Miron, Panchenko, Alexander, Tutubalina, Elena, Somov, Oleg

arXiv.org Artificial IntelligenceJul-17-2025

Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System

information retrieval, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-97141-9_29

2507.11954

Country:

North America > United States (0.46)
Europe > Russia (0.29)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Extracting Document Relations from Search Corpus by Marginalizing over User Queries

Iwamoto, Yuki, Tsunoda, Kaoru, Kaneiwa, Ken

arXiv.org Artificial IntelligenceJul-16-2025

Understanding relationships between documents in large-scale corpora is essential for knowledge discovery and information organization. However, existing approaches rely heavily on manual annotation or predefined relationship taxonomies. W e propose EDR-MQ (Extracting Document Relations by Marginalizing over User Queries), a novel framework that discovers document relationships through query marginalization. EDR-MQ is based on the insight that strongly related documents often co-occur in results across diverse user queries, enabling us to estimate joint probabilities between document pairs by marginalizing over a collection of queries. T o enable this query marginalization approach, we develop Multiply Conditioned Retrieval-Augmented Generation (MC-RAG), which employs conditional retrieval where subsequent document retrievals depend on previously retrieved content. By observing co-occurrence patterns across diverse queries, EDR-MQ estimates joint probabilities between document pairs without requiring labeled training data or predefined taxonomies. Experimental results show that our query marginalization approach successfully identifies meaningful document relationships, revealing topical clusters, evidence chains, and cross-domain connections that are not apparent through traditional similarity-based methods. Our query-driven framework offers a practical approach to document organization that adapts to different user perspectives and information needs.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.10726

Country:

Asia > Japan (0.28)
North America > United States (0.28)

Genre: Research Report (0.70)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Tripathi, Vishesh, Odapally, Tanmay, Das, Indraneel, Allu, Uday, Ahmed, Biddwan

arXiv.org Artificial IntelligenceJul-15-2025

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on a curated dataset of PDF documents with manually crafted queries, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better accuracy compared to traditional vanilla RAG systems, with qualitative analysis showing superior preservation of document structure and semantic coherence.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.16035

Genre: Workflow (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora

Huynh, Tuan-Luc, Vu, Thuy-Trang, Wang, Weiqing, Le, Trung, Gašević, Dragan, Li, Yuan-Fang, Do, Thanh-Toan

arXiv.org Artificial IntelligenceJul-15-2025

Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.

information retrieval, machine learning, mixlora-dsi, (14 more...)

arXiv.org Artificial Intelligence

2507.09924

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)

Add feedback