AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Rand-NSG: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node

Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, Rohan Kadekodi

Neural Information Processing SystemsOct-2-2025, 00:52:49 GMT

This makes them expensive and limits the size of the dataset.

information retrieval, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.65)

Add feedback

Taxonomy of Comprehensive Safety for Clinical Agents

Seo, Jean, Lee, Hyunkyung, Kim, Gibaeg, Han, Wooseok, Yoo, Jaehyo, Lim, Seungseop, Shin, Kihun, Yang, Eunho

arXiv.org Artificial IntelligenceOct-2-2025

Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods--such as guardrails and tool calling--often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our taxonomy, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.

information retrieval, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2509.22041

Genre: Research Report > New Finding (0.86)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Law (0.93)
Information Technology > Security & Privacy (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)

Add feedback

Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs

Goworek, Roksana, Macmillan-Scott, Olivia, Özyiğit, Eda B.

arXiv.org Artificial IntelligenceOct-2-2025

Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.00908

Country:

Europe (1.00)
Asia > Middle East (1.00)
North America > United States > California (0.67)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Education (1.00)
Government (0.67)
Media > News (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Nguyen, Thong, Lei, Yibin, Ju, Jia-Huei, Yang, Eugene, Yates, Andrew

arXiv.org Artificial IntelligenceOct-2-2025

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions. Learned Sparse Retrieval (LSR)(MacAvaney et al., 2020; Formal et al., 2021; Nguyen et al., 2023) represents queries and documents as sparse lexical embeddings and retains the scalability benefits of bi-encoders. Unlike dense methods, LSR aligns representation with a natural language vocabulary, yielding transparent representations that facilitate error tracing and bias inspection. LSR naturally supports dynamic post-hoc pruning at inference time (Bruch et al., 2024), providing Matryoshka-like latency control (Kusupati et al., 2022) without requiring auxiliary training objectives. Empirically, LSR (Lassance et al., 2024; Lei et al., 2025) is competitive on benchmarks like BEIR (Thakur et al., 2021) and MTEB (Enevoldsen et al., 2025).

large language model, machine learning, milco, (17 more...)

arXiv.org Artificial Intelligence

2510.00671

Genre: Research Report (0.50)

Industry: Law (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation

Wang, Ting-Kang, Peng, Yueh-Po, Su, Li, Cheung, Vincent K. M.

arXiv.org Artificial IntelligenceOct-1-2025

While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight, end-to-end model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

information retrieval, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.23759

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

Combining Knowledge Graphs and NLP to Analyze Instant Messaging Data in Criminal Investigations

Pozzi, Riccardo, Barbera, Valentina, Principe, Renzo Alva, Giardini, Davide, Rubini, Riccardo, Palmonari, Matteo

arXiv.org Artificial IntelligenceOct-1-2025

Criminal investigations often involve the analysis of messages exchanged through instant messaging apps such as WhatsApp, which can be an extremely effort-consuming task. Our approach integrates knowledge graphs and NLP models to support this analysis by semantically enriching data collected from suspects' mobile phones, and help prosecutors and investigators search into the data and get valuable insights. Our semantic enrichment process involves extracting message data and modeling it using a knowledge graph, generating transcriptions of voice messages, and annotating the data using an end-to-end entity extraction approach. We adopt two different solutions to help users get insights into the data, one based on querying and visualizing the graph, and one based on semantic search. The proposed approach ensures that users can verify the information by accessing the original data. While we report about early results and prototypes developed in the context of an ongoing project, our proposal has undergone practical applications with real investigation data. As a consequence, we had the chance to interact closely with prosecutors, collecting positive feedback but also identifying interesting opportunities as well as promising research directions to share with the research community.

information retrieval, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-981-96-0567-5_30

2509.26487

Country:

Europe > Italy (0.28)
Europe > Austria (0.28)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
(2 more...)

Add feedback

DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation

Esakkiraja, Esakkivel, Akhiyarov, Denis, Shanmugham, Aditya, Ganapathy, Chitra

arXiv.org Artificial IntelligenceOct-1-2025

Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised fine-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, effectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.

information retrieval, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2509.25716

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.82)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Efficient Optimization for Average Precision SVM

Neural Information Processing SystemsSep-30-2025, 10:51:10 GMT

The accuracy of information retrieval systems is often measured using average precision (AP). Given a set of positive (relevant) and negative (non-relevant) samples, the parameters of a retrieval system can be estimated using the AP-SVM framework, which minimizes a regularized convex upper bound on the empirical AP loss. However, the high computational complexity of loss-augmented inference, which is required for learning an AP-SVM, prohibits its use with large training datasets. To alleviate this deficiency, we propose three complementary approaches.

efficient optimization, loss-augmented inference, name change, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.60)

Add feedback

Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets

Bruch, Sebastian, Nardini, Franco Maria, Rulli, Cosimo, Venturini, Rossano

arXiv.org Artificial IntelligenceSep-30-2025

Sparse embeddings of data form an attractive class due to their inherent interpretability: Every dimension is tied to a term in some vocabulary, making it easy to visually decipher the latent space. Sparsity, however, poses unique challenges for Approximate Nearest Neighbor Search (ANNS) which finds, from a collection of vectors, the k vectors closest to a query. To encourage research on this underexplored topic, sparse ANNS featured prominently in a BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on large benchmark datasets by throughput and accuracy. In this work, we introduce a set of novel data structures and algorithmic methods, a combination of which leads to an elegant, effective, and highly efficient solution to sparse ANNS. Our contributions range from a theoretically-grounded sketching algorithm for sparse vectors to reduce their effective dimensionality while preserving inner product-induced ranks; a geometric organization of the inverted index; and the blending of local and global information to improve the efficiency and efficacy of ANNS. Empirically, our final algorithm, dubbed Seismic, reaches sub-millisecond per-query latency with high accuracy on a large-scale benchmark dataset using a single CPU.

information retrieval, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.24815

Country:

North America > United States > Florida > Hillsborough County > University (0.40)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
(17 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Add feedback

Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement

Dao, An, Tran, Vu, Nguyen, Le-Minh, Matsumoto, Yuji

arXiv.org Artificial IntelligenceSep-30-2025

We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.24283

Country: Asia (0.14)

Genre:

Research Report (0.82)
Overview (0.68)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback