AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Introducing the A2AJ's Canadian Legal Data: An open-source alternative to CanLII for the era of computational law

Wallace, Simon, Rehaag, Sean

arXiv.org Artificial IntelligenceSep-17-2025

The Access to Algorithmic Justice project (A2AJ) is an open-source alternative to the Canadian Legal Information Institute (CanLII). At a moment when technology promises to enable new ways of working with law, CanLII is becoming an impediment to the free access of law and access to justice movements because it restricts bulk and programmatic access to Canadian legal data. This means that Canada is staring down a digital divide: well-resourced actors have the best new technological tools and, because CanLII has disclaimed leadership, the public only gets second-rate tools. This article puts CanLII in its larger historical context and shows how long and deep efforts to democratize access to Canadian legal data are, and how often they are thwarted by private industry. We introduce the A2AJ's Canadian Legal Data project, which provides open access to over 116,000 court decisions and 5,000 statutes through multiple channels including APIs, machine learning datasets, and AI integration protocols. Through concrete examples, we demonstrate how open legal data enables courts to conduct evidence-based assessments and allows developers to create tools for practitioners serving low-income communities.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.13032

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > Canada > Ontario (0.05)
North America > United States > Arkansas (0.04)
(6 more...)

Genre: Research Report (0.40)

Industry:

Law > Statutes (1.00)
Government > Regional Government > North America Government > Canada Government (0.94)
Government > Regional Government > North America Government > United States Government (0.93)
(2 more...)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
(2 more...)

Add feedback

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

Vujanic, Robin, Rueckstiess, Thomas

arXiv.org Artificial IntelligenceSep-17-2025

We present LEAF ("Lightweight Embedding Alignment Framework"), a knowledge distillation framework for text embedding models. A key distinguishing feature is that our distilled leaf models are aligned to their teacher. In the context of information retrieval, this allows for flexible asymmetric architectures where documents are encoded with the larger teacher model, while queries can be served with the smaller leaf models. We also show that leaf models automatically inherit MRL and robustness to output quantization whenever these properties are present in the teacher model, without explicitly training for them. To demonstrate the capability of our framework we publish leaf-ir, a 23M parameters information retrieval oriented text embedding model trained using LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the public leaderboard for this benchmark and for models of its size. When run in asymmetric mode, its retrieval performance is further increased. Our scheme is however not restricted to the information retrieval setting, and we demonstrate its wider applicability by synthesizing the multi-task leaf-mt model. This also sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its size. LEAF is applicable to black-box models and in contrast to other embedding model training frameworks, it does not require judgments nor hard negatives, and training can be conducted using small batch sizes. Thus, dataset and training infrastructure requirements for our framework are modest. We make our models publicly available under a permissive Apache 2.0 license.

information retrieval, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.12539

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Query-Focused Extractive Summarization for Sentiment Explanation

Moubtahij, Ahmed, Ratté, Sylvie, Attabi, Yazid, Dumas, Maxime

arXiv.org Artificial IntelligenceSep-16-2025

Constructive analysis of feedback from clients often requires determining the cause of their sentiment from a substantial amount of text documents. To assist and improve the productivity of such endeavors, we leverage the task of Query-Focused Summarization (QFS). Models of this task are often impeded by the linguistic dissonance between the query and the source documents. We propose and substantiate a multi-bias framework to help bridge this gap at a domain-agnostic, generic level; we then formulate specialized approaches for the problem of sentiment explanation through sentiment-based biases and query expansion. We achieve experimental results outperforming baseline models on a real-world proprietary sentiment-aware QFS dataset.

computational linguistic, large language model, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2509.11989

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Powering Job Search at Scale: LLM-Enhanced Query Understanding in Job Matching Systems

Liu, Ping, Shen, Jianqiang, Shen, Qianqi, Yao, Chunnan, Kao, Kevin, Xu, Dan, Arora, Rajat, Zheng, Baofen, Johnson, Caleb, Hong, Liangjie, Wu, Jingwei, Zhang, Wenjing

arXiv.org Artificial IntelligenceSep-15-2025

Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving taxonomies and language patterns. In this paper, we introduce a unified query understanding framework powered by a Large Language Model (LLM), designed to address these limitations. Our approach jointly models the user query and contextual signals such as profile attributes to generate structured interpretations that drive more accurate and personalized recommendations. The framework improves relevance quality in online A/B testing while significantly reducing system complexity and operational overhead. The results demonstrate that our solution provides a scalable and adaptable foundation for query understanding in dynamic web applications.

information retrieval, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746252.3760913

2509.0969

Country: North America > United States (0.51)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Services (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs

Li, Mingyang, Schlegel, Viktor, Mu, Tingting, Del-Pinto, Warren, Nenadic, Goran

arXiv.org Artificial IntelligenceSep-15-2025

Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient's condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23\% of the original text while retaining 90\% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20\% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.09699

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.85)

Add feedback

Inteligencia Artificial jurídica y el desafío de la veracidad: análisis de alucinaciones, optimización de RAG y principios para una integración responsable

Dantart, Alex

arXiv.org Artificial IntelligenceSep-12-2025

This technical report analyzes the challenge of "hallucinations" (false information) in LLMs applied to law. It examines their causes, manifestations, and the effectiveness of the RAG mitigation strategy, highlighting its limitations and proposing holistic optimizations. The paper explores the ethical and regulatory implications, emphasizing human oversight as an irreplaceable role. It concludes that the solution lies not in incrementally improving generative models, but in adopting a "consultative" AI paradigm that prioritizes veracity and traceability, acting as a tool to amplify, not replace, professional judgment. -- Este informe técnico analiza el desafío de las "alucinaciones" (información falsa) en los LLMs aplicados al derecho. Se examinan sus causas, manifestaciones y la efectividad de la estrategia de mitigación RAG, exponiendo sus limitaciones y proponiendo optimizaciones holísticas. Se exploran las implicaciones éticas y regulatorias, enfatizando la supervisión humana como un rol insustituible. El documento concluye que la solución no reside en mejorar incrementalmente los modelos generativos, sino en adoptar un paradigma de IA "consultiva" que priorice la veracidad y la trazabilidad, actuando como una herramienta para amplificar, y no sustituir, el juicio profesional.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.09467

Country:

Europe (1.00)
North America > United States > California (0.45)

Genre:

Overview (0.92)
Research Report > New Finding (0.45)

Industry:

Law > Statutes (1.00)
Law > Litigation (1.00)
Law > Government & the Courts (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

QCardEst/QCardCorr: Quantum Cardinality Estimation and Correction

Winker, Tobias, Groppe, Jinghua, Groppe, Sven

arXiv.org Artificial IntelligenceSep-11-2025

Cardinality estimation is an important part of query optimization in DBMS. We develop a Quantum Cardinality Estimation (QCardEst) approach using Quantum Machine Learning with a Hybrid Quantum-Classical Network. We define a compact encoding for turning SQL queries into a quantum state, which requires only qubits equal to the number of tables in the query. This allows the processing of a complete query with a single variational quantum circuit (VQC) on current hardware. In addition, we compare multiple classical post-processing layers to turn the probability vector output of VQC into a cardinality value. We introduce Quantum Cardinality Correction QCardCorr, which improves classical cardinality estimators by multiplying the output with a factor generated by a VQC to improve the cardinality estimation. With QCardCorr, we have an improvement over the standard PostgreSQL optimizer of 6.37 times for JOB-light and 8.66 times for STATS. For JOB-light we even outperform MSCN by a factor of 3.47.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.08817

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.82)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

PianoVAM: A Multimodal Piano Performance Dataset

Kim, Yonghyun, Park, Junhyung, Bae, Joonhyung, Kim, Kirak, Kwon, Taegyun, Lerch, Alexander, Nam, Juhan

arXiv.org Artificial IntelligenceSep-11-2025

The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.088

Country: Asia > South Korea (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.68)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.66)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)

Add feedback

ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval

Chen, Zihan, Shi, Lei, Wu, Weize, Zhou, Qiji, Zhang, Yue

arXiv.org Artificial IntelligenceSep-10-2025

Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5\%-10\% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2509.07512

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Benchmarking Information Retrieval Models on Complex Retrieval Tasks

Killingback, Julian, Zamani, Hamed

arXiv.org Artificial IntelligenceSep-10-2025

Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.07253

Country:

Europe (1.00)
North America > United States > Maryland (0.28)
North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Media > Film (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.65)

Add feedback