Attacking interpretable NLP systems

Abdukhamidov, Eldor, Abuhmed, Tamer, Santos, Joanna C. S., Abuhamad, Mohammed

Jul-23-2025–arXiv.org Artificial Intelligence

--Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples. Deep learning models, particularly in Natural Language Processing (NLP), have revolutionized how machines understand and interact with human language. These advancements have enabled various applications, from simple spellcheck and keyword search to complex tasks such as sentiment analysis [1], machine translation [2], and chatbot interactions [3]. The integration of NLP into our daily digital interactions, such as through search engines, virtual assistants, and recommendation systems, highlights its importance. However, these models are shown to be susceptible to adversarial attacks [4]. Adversarial attacks in NLP, which involve careful manipulations of input data leading to incorrect model outputs, are a growing concern. These attacks are especially stealthy because of the complex nature of human language, which is filled with idioms, metaphors, and context-dependent meanings [5]. Eldor Abdukhamidov and Tamer Abuhmed are with the Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea.(E-mail:

interpreter, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-23-2025

arXiv.org PDF

Add feedback

Country:
- Asia > South Korea > Gyeonggi-do > Suwon (0.24)

Genre:
- Overview (1.00)
- Research Report > New Finding (0.93)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found