AITopics | bm25

Collaborating Authors

bm25

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.

large language model, machine learning, precedent, (20 more...)

arXiv.org Artificial Intelligence

2511.00268

Country: Asia > India (0.68)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry:

Law > Statutes (0.68)
Government > Regional Government > Asia Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
(2 more...)

Add feedback

Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Galimzyanov, Timur, Kolomyttseva, Olga, Bogomolov, Egor

arXiv.org Artificial IntelligenceOct-24-2025

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.20609

Country:

Asia (0.46)
North America > United States (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

Wei, Yubai, Han, Jiale, Yang, Yi

arXiv.org Artificial IntelligenceOct-22-2025

Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-acl.357

2506.00363

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.93)

Genre: Research Report > New Finding (0.68)

Industry: Retail (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

B.2 Hierarchical k -means for semantic identifier The pseudo code of hierarchical k -means is detailed in in Algorithm 1. Algorithm 1: Hierarchical k -means.

Add feedback

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

Jing, Huihao, Hu, Wenbin, Luo, Hongyu, Yang, Jianhui, Fan, Wei, Li, Haoran, Song, Yangqiu

arXiv.org Artificial IntelligenceOct-1-2025

Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

agent, artificial intelligence, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.24922

Country: Asia (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

How Do LLM-Generated Texts Impact Term-Based Retrieval Models?

Huang, Wei, Bi, Keping, Cai, Yinqiong, Chen, Wei, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceAug-26-2025

As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.

conferenceacronym, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2508.17715

Country:

Europe (0.46)
Asia > China (0.29)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

) or not using () supervised/unsupervised data. Reporting accuracy @ k. System Sup. Unsup. A@20 A@100 BM25 - - 62.9 78.1 SEAL 76.2 86.3 (LM+FM74.8 85.4 intersective) 61.7 76.3

Neural Information Processing SystemsAug-19-2025, 00:22:04 GMT

Table 10: Retrieval results on the KIL T test set(s). Reporting page-level R-precision (higher is better). Results are taken from the eval.ai Table 11: KIL T scores on the KIL T test set(s). Results are taken from the eval.ai

artificial intelligence, machine learning, supervised unsupervised data, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.33)

Add feedback

Filters

Collaborating Authors

bm25

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

96671501524948bc3937b4b30d0e57b9-Supplemental.pdf

cd88d62a2063fdaf7ce6f9068fb15dcd-Supplemental-Conference.pdf

a46156bd3579c3b268108ea6aca71d13-Supplemental-Conference.pdf

IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval

Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval

A Related work

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

How Do LLM-Generated Texts Impact Term-Based Retrieval Models?

) or not using () supervised/unsupervised data. Reporting accuracy @ k. System Sup. Unsup. A@20 A@100 BM25 - - 62.9 78.1 SEAL 76.2 86.3 (LM+FM74.8 85.4 intersective) 61.7 76.3