SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval

Huang, Xinhao, Ren, Zhibo, Yu, Yipeng, Zhou, Ying, Chen, Zulong, Wen, Zeyi

Sep-3-2025–arXiv.org Artificial Intelligence

In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\% to 77.84\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Sep-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Guangdong Province (0.14)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Information Retrieval (0.68)
    - Text Processing (0.68)
    - Large Language Model (0.47)