Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Ribadas-Pena, Francisco J., Cao, Shuyuan, Bilbao, Víctor M. Darriba

Feb-2-2024–arXiv.org Artificial Intelligence

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

classification, representation, vector, (17 more...)

arXiv.org Artificial Intelligence

Feb-2-2024

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Rio de Janeiro > Rio de Janeiro (0.04)
- Oceania > New Zealand
  - North Island > Auckland Region > Auckland (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.28)
  - California > San Francisco County
    - San Francisco (0.14)
- Europe
  - Spain
    - Galicia > Ourense Province
      - Ourense (0.04)
    - Catalonia > Barcelona Province
      - Barcelona (0.04)
  - Romania > București - Ilfov Development Region
    - Municipality of Bucharest > Bucharest (0.04)
  - Germany > Baden-Württemberg
    - Karlsruhe Region > Heidelberg (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia > China
  - Hong Kong (0.04)
- Africa > Ethiopia
  - Addis Ababa > Addis Ababa (0.04)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning > Nearest Neighbor Methods (1.00)
  - Neural Networks (1.00)