The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Jan-19-2025, 19:58:41 GMT–Neural Information Processing Systems

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Though the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, machine learning offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike other NLP patent datasets, HUPD contains the inventor-submitted versions of patent applications, not the final versions of granted patents, allowing us to study patentability at the time of filing using NLP methods for the first time.

harvard uspto patent dataset, multi-purpose corpus, patent application, (7 more...)

Neural Information Processing Systems

Jan-19-2025, 19:58:41 GMT

Conferences Web Page

Add feedback

Country:
- North America > United States (1.00)

Industry:
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government
  - North America Government > United States Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.77)
  - Natural Language (0.56)