Goto

Collaborating Authors

 chunker


Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation

Allamraju, Aparajitha, Chitale, Maitreya Prafulla, Adibhatla, Hiranmai Sri, Mishra, Rahul, Shrivastava, Manish

arXiv.org Artificial Intelligence

Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements ( 24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response-time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out-of-domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.


Passage Segmentation of Documents for Extractive Question Answering

Liu, Zuhong, Simon, Charles-Elie, Caspani, Fabien

arXiv.org Artificial Intelligence

Retrieval-Augmented Generation (RAG) has proven effective in open-domain question answering. However, the chunking process, which is essential to this pipeline, often receives insufficient attention relative to retrieval and synthesis components. This study emphasizes the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline. We then introduce the Logits-Guided Multi-Granular Chunker (LGMGC), a novel framework that splits long documents into contextualized, self-contained chunks of varied granularity. Our experimental results, evaluated on two benchmark datasets, demonstrate that LGMGC not only improves the retrieval step but also outperforms existing chunking methods when integrated into a RAG pipeline.


Is Semantic Chunking Worth the Computational Cost?

Qu, Renyi, Tu, Ruixuan, Bao, Forrest

arXiv.org Artificial Intelligence

Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.


Joint Word Segmentation, POS-Tagging and Syntactic Chunking

Lyu, Chen (Wuhan University) | Zhang, Yue (Sinparore University of Technology and Design) | Ji, Donghong (Wuhan University)

AAAI Conferences

Chinese chunking has traditionally been solved by assuming gold standard word segmentation.We find that the accuracies drop drastically when automatic segmentation is used.Inspired by the fact that chunking knowledge can potentially improve segmentation, we explore a joint model that performs segmentation, POS-tagging and chunking simultaneously.In addition, to address the sparsity of full chunk features, we employ a semi-supervised method to derive chunk cluster features from large-scale automatically-chunked data.Results show the effectiveness of the joint model with semi-supervised features.


Rule Representations in a Connectionist Chunker

Touretzky, David S., III, Gillette Elvgreen

Neural Information Processing Systems

We present two connectionist architectures for chunking of symbolic rewrite rules. One uses backpropagation learning, the other competitive learning. Although they were developed for chunking the same sorts of rules, the two differ in their representational abilities and learning behaviors.


Rule Representations in a Connectionist Chunker

Touretzky, David S., III, Gillette Elvgreen

Neural Information Processing Systems

We present two connectionist architectures for chunking of symbolic rewrite rules. One uses backpropagation learning, the other competitive learning. Although they were developed for chunking the same sorts of rules, the two differ in their representational abilities and learning behaviors.


Rule Representations in a Connectionist Chunker

Touretzky, David S., III, Gillette Elvgreen

Neural Information Processing Systems

We present two connectionist architectures for chunking of symbolic rewrite rules. One uses backpropagation learning, the other competitive learning. Although they were developed for chunking the same sorts of rules, the two differ in their representational abilities and learning behaviors.