Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

Yang, Han, Lan, Jian, Liu, Yihong, Schütze, Hinrich, Seidl, Thomas

Sep-1-2025–arXiv.org Artificial Intelligence

Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Sep-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.95)
- North America > United States (0.94)
- Asia > Middle East
  - UAE (0.28)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Natural Language > Machine Translation (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found