Gzip versus bag-of-words for text classification

Aug-8-2023–arXiv.org Artificial Intelligence

KNN is a simple classifier that uses distance measurements between data points: For a given testing point, we calculate its distance to every other point from some labeled training set and check the labels of the K closest points (i.e., the K-Neirest Neighbors), predicting the label that we observe most frequently. Hence, it is straightforward to build a general text classifier from KNN, if we can equip it with a sensible distance measure between documents. Interestingly, recent findings [4] suggest that we can exploit compression to assess the distance of two documents, by comparing their individual compression lengths to the length of their compressed concatenation (we call this measurement gzip). With this approach, [4] show strong text classification performance across different data sets, sometimes achieving higher accuracy than trained neural classifiers such as BERT [2], especially in scenarios where only few training data are available. Against this background, it is not surprising that gzip has quickly attracted lots of attention.

machine learning, natural language, training data, (20 more...)

arXiv.org Artificial Intelligence

Aug-8-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Statistical Learning (0.47)
  - Natural Language
    - Text Classification (0.72)
    - Text Processing (0.54)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found