Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations

Oct-8-2021–arXiv.org Artificial Intelligence

In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that imitates successful text-processing techniques with math is shown to yield better results than flat TeX tokens.

classification, mterm, representation, (14 more...)

arXiv.org Artificial Intelligence

Oct-8-2021

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
  - California > Santa Clara County
    - Mountain View (0.04)
- Europe
  - Middle East > Malta (0.04)
  - Germany (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
    - West Midlands > Birmingham (0.04)
  - Italy > Tuscany
    - Pisa Province > Pisa (0.04)
  - Czechia > South Moravian Region
    - Brno (0.04)
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Text Processing (1.00)
    - Text Classification (1.00)