Hierarchical Multi-Label Classification of Scientific Documents
Sadat, Mobashir, Caragea, Cornelia
–arXiv.org Artificial Intelligence
Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often assigned to more than one relevant topic. For example, a paper can be assigned to several topics in a hierarchy tree. In this paper, we introduce a new dataset for hierarchical multi-label text classification (HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and 1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC and propose a multi-task learning approach for topic classification with keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities on hierarchical scientific topic classification. We make our dataset and code available on Github.
arXiv.org Artificial Intelligence
Nov-5-2022
- Country:
- Oceania > Australia
- North America
- Canada (0.04)
- United States
- New York > New York County
- New York City (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Illinois > Cook County
- Chicago (0.04)
- California
- San Diego County > San Diego (0.04)
- Santa Clara County > Palo Alto (0.04)
- New York > New York County
- Europe > Italy
- Asia > China
- Hong Kong (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology > Security & Privacy (0.68)
- Technology: