Patent Representation Learning via Self-supervision

Zuo, You, Gerdes, Kim, de La Clergerie, Eric Villemonte, Sagot, Benoît

Nov-17-2025–arXiv.org Artificial Intelligence

This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents' inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.

machine learning, natural language, patent, (18 more...)

arXiv.org Artificial Intelligence

Nov-17-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (1.00)

Industry:
- Law > Intellectual Property & Technology Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found