An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Dec-23-2024–arXiv.org Artificial Intelligence

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

machine learning, natural language, text classification, (19 more...)

arXiv.org Artificial Intelligence

Dec-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Honshū (0.14)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Classification (1.00)
  - Machine Learning
    - Statistical Learning > Regression (0.71)
    - Neural Networks > Deep Learning (0.70)
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.51)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found