An Analysis of Hierarchical Text Classification Using Word Embeddings

Stein, Roger A., Jaques, Patricia A., Valiati, Joao F.

Sep-5-2018–arXiv.org Artificial Intelligence

Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations---fastText, XGBoost, SVM, and Keras' CNN---and noticeable word embeddings generation methods---GloVe, word2vec, and fastText---with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an ${}_{LCA}F_1$ of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.

classification, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Sep-5-2018

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Rio Grande do Sul > Porto Alegre (0.04)
- North America > United States
  - District of Columbia > Washington (0.04)
  - New York > New York County
    - New York City (0.04)
  - California
    - San Francisco County > San Francisco (0.14)
    - Santa Clara County > Palo Alto (0.04)
    - San Diego County > San Diego (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Greece > Attica
    - Athens (0.04)
  - France > Auvergne-Rhône-Alpes
    - Isère > Grenoble (0.04)

Genre:
- Research Report
  - New Finding (1.00)
  - Promising Solution (0.87)

Industry:
- Law (0.92)
- Health & Medicine (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Classification (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Performance Analysis > Accuracy (0.93)
    - Statistical Learning
      - Support Vector Machines (0.68)
      - Regression (0.67)
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found