Identifying Semantically Difficult Samples to Improve Text Classification

Mujumdar, Shashank, Mehta, Stuti, Patel, Hima, Mitra, Suman

Feb-13-2023–arXiv.org Artificial Intelligence

In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.

difficult sample, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

Feb-13-2023

arXiv.org PDF

Add feedback

Country:
- Asia > India (0.05)
- South America > Brazil (0.04)
- North America > United States
  - Massachusetts > Suffolk County > Boston (0.04)

Genre:
- Research Report (0.82)

Industry:
- Information Technology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Classification (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found