AITopics

In the digital era, user interactions with various resources such as databases, data warehouses, websites, and knowledge graphs (KGs) are increasingly mediated through digital platforms. These interactions leave behind digital traces, systematically captured in the form of logs. Logs, when effectively exploited, provide high value across industry and academia, supporting critical services (e.g., recovery and security), user-centric applications (e.g., recommender systems), and quality-of-service improvements (e.g., performance optimization). Despite their importance, research on log usage remains fragmented across domains, and no comprehensive study currently consolidates existing efforts. This paper presents a systematic survey of log usage, focusing on Database (DB), Data Warehouse (DW), Web, and KG logs. More than 300 publications were analyzed to address three central questions: (1) do different types of logs share common structural and functional characteristics? (2) are there standard pipelines for their usage? (3) which constraints and non-functional requirements (NFRs) guide their exploitation?. The survey reveals a limited number of end-to-end approaches, the absence of standardization across log usage pipelines, and the existence of shared structural elements among different types of logs. By consolidating existing knowledge, identifying gaps, and highlighting opportunities, this survey provides researchers and practitioners with a comprehensive overview of log usage and sheds light on promising directions for future research, particularly regarding the exploitation and democratization of KG logs.

data mining, information retrieval, machine learning, (22 more...)

2508.13949

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Human Computer Interaction (1.00)
Information Technology > Data Science > Data Mining > Big Data (1.00)
(4 more...)

InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems

Krastev, Matey, Hamar, Miklos, Toapanta, Danilo, Brouwers, Jesse, Lei, Yibin

This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.

information retrieval, large language model, machine learning, (19 more...)

2508.1393

Country: Europe > Netherlands (0.16)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Multimodal Data Storage and Retrieval for Embodied AI: A Survey

Lu, Yihao, Tang, Hao

Embodied AI (EAI) agents continuously interact with the physical world, generating vast, heterogeneous multimodal data streams that traditional management systems are ill-equipped to handle. In this survey, we first systematically evaluate five storage architectures (Graph Databases, Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series Databases), focusing on their suitability for addressing EAI's core requirements, including physical grounding, low-latency access, and dynamic scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based Optimization), revealing a fundamental tension between achieving long-term semantic coherence and maintaining real-time responsiveness. Based on this comprehensive analysis, we identify key bottlenecks, spanning from the foundational Physical Grounding Gap to systemic challenges in cross-modal integration, dynamic adaptation, and open-world generalization. Finally, we outline a forward-looking research agenda encompassing physics-aware data models, adaptive storage-retrieval co-optimization, and standardized benchmarking, to guide future research toward principled data management solutions for EAI. Our survey is based on a comprehensive review of more than 180 related studies, providing a rigorous roadmap for designing the robust, high-performance data management frameworks essential for the next generation of autonomous embodied systems.

large language model, machine learning, real time system, (18 more...)

2508.13901

Country: Asia > China (0.28)

Genre: Overview (1.00)

Industry:

Information Technology (0.93)
Education (0.67)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(9 more...)

The Interpretability Analysis of the Model Can Bring Improvements to the Text-to-SQL Task

Zhang, Cong

Currently, AI technology is profoundly transforming the database landscape. Text - to - SQL, by innovating data provisioning to cater to the information retrieval and data analysis needs of a broader audience of everyday users, is emerging as a catalyst for propelling databases towards greater efficiency, collaboration, and intelligence. In recent years, text - to - SQL solutions leveraging large autoregressive models have continually surpassed existing methods on be nchmark datasets for multi - table complex queries (Zhu et al., 2024), such as Spider (Yu et al., 2018c) and BIRD (Li et al., 2023), attributed to their exceptional natural language underst anding and generation capabilities. In reality, it is highly prevalent for users of reporting systems to conduct simple queries, statistical analyses, and evaluations on consolidated single - report data derived from multi - table integration and field augmentation within databases. The single - table query dataset exemplified by WikiSQL (Zhong et al., 2017) aligns well with this application scenario. Despite its relatively straightforward synta x and lesser complexity when compared to datasets like Spider and BIRD (Deng et al., 2022), WikiSQL continues to serve as a pivotal benchmark for demonstrating the technical feasibility of converting natural language into simple SQL and validating the fundamental capabilities of models.

information retrieval, machine learning, natural language, (19 more...)

2508.13178

Country: Asia (0.29)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

David P. Woodruff, Carnegie Mellon University, dwoodruf@cs.cmu.edu "3026 Fred Zhang, UC Berkeley, z0@berkeley.edu, "3026 Qiuyi (Richard) Zhang, Google Brain, qiuyiz@google.com

Optimal Query Complexities for Dynamic Trace Estimation

Neural Information Processing SystemsAug-19-2025, 13:41:09 GMT

We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly, such as during an optimization process.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.52)

David P. Woodruff, Carnegie Mellon University, dwoodruf@cs.cmu.edu "3026 Fred Zhang, UC Berkeley, z0@berkeley.edu, "3026 Qiuyi (Richard) Zhang, Google Brain, qiuyiz@google.com

Optimal Query Complexities for Dynamic Trace Estimation

Neural Information Processing SystemsAug-19-2025, 13:41:06 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Brazos County > College Station (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.43)

Neural Information Processing SystemsAug-19-2025, 00:22:00 GMT

Autoregressive Search Engines: Generating Substrings as Document Identifiers Michele Bevilacqua 1,2 Giuseppe Ottaviano 2 Patrick Lewis 2 Wen-tau Yih 2

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus.

computational linguistic, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
North America > Dominican Republic (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Gregoriadis, Marcel, Kang, Jingwei, Pouwelse, Johan

A Large-Scale Web Search Dataset for Federated Online Learning to Rank

arXiv.org Artificial IntelligenceAug-19-2025

The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynamics and undermines the realism of experimental results. We present AOL4FOLTR, a large-scale web search dataset with 2.6 million queries from 10,000 users. Our dataset addresses key limitations of existing benchmarks by including user identifiers, real click data, and query timestamps, enabling realistic user partitioning, behavior modeling, and asynchronous federated learning scenarios.

information retrieval, machine learning, natural language, (15 more...)

doi: 10.1145/3746252.3761651

2508.12353

Country: Europe > Netherlands (0.29)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.68)
Education > Educational Setting > Online (0.63)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Nazi, Zabir Al, Hristidis, Vagelis, McLean, Aaron Lawson, Meem, Jannat Ara, Chowdhury, Md Taukir Azam

Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models

arXiv.org Artificial IntelligenceAug-19-2025

Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.

large language model, machine learning, natural language, (16 more...)

2508.11784

Country: North America > United States > California > Riverside County > Riverside (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-19-2025

RRRA: Resampling and Reranking through a Retriever Adapter

Kim, Bongsu

Recent methods apply heuristics based on positive document scores to identify hard negatives, improving both performance and interpretability. However, these global, example-agnostic strategies often miss instance-specific false negatives. To address this, we propose a learnable adapter module that monitors Bi-Encoder representations to estimate the likelihood that a hard negative is actually a false negative. This probability is modeled dynamically and contextually, enabling fine-grained, query-specific judgments. The predicted scores are used in two downstream components: (1) resampling, where negatives are rewei-ghted during training, and (2) reranking, where top-k retrieved documents are reordered at inference. Empirical results on standard benchmarks show that our adapter-enhanced framework consistently outperforms strong Bi-Encoder baselines, underscoring the benefit of explicit false negative modeling in dense retrieval.

information retrieval, machine learning, natural language, (19 more...)

2508.1167

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)