Hierarchical Multi-Label Classification of Scientific Documents

Nov-5-2022–arXiv.org Artificial Intelligence

Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often assigned to more than one relevant topic. For example, a paper can be assigned to several topics in a hierarchy tree. In this paper, we introduce a new dataset for hierarchical multi-label text classification (HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and 1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC and propose a multi-task learning approach for topic classification with keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities on hierarchical scientific topic classification. We make our dataset and code available on Github.

machine learning, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

Nov-5-2022

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - Canada (0.04)
  - United States
    - New York > New York County
      - New York City (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Illinois > Cook County
      - Chicago (0.04)
    - California
      - San Diego County > San Diego (0.04)
      - Santa Clara County > Palo Alto (0.04)
- Europe > Italy
  - Tuscany > Florence (0.04)
- Asia > China
  - Hong Kong (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Classification (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found