Knowledge of Pretrained Language Models on Surface Information of Tokens
Hiraoka, Tatsuya, Okazaki, Naoaki
–arXiv.org Artificial Intelligence
Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained Figure 1: Input and output examples when asking GPT-language models have knowledge regarding token 3.5 Turbo about the surface information of words (as length and substrings but not token constitution. of 1st, Jan. 2024). The Japanese example has the same Additionally, the results imply that there meaning as the English text, asking the length of and is a bottleneck on the decoder side in terms of third character in 人類学者 (anthropologist).
arXiv.org Artificial Intelligence
Feb-22-2024
- Country:
- Asia > Japan
- Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Europe
- Czechia > South Moravian Region
- Brno (0.04)
- Germany > Berlin (0.04)
- Czechia > South Moravian Region
- North America
- Canada > Ontario
- Toronto (0.04)
- United States
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Louisiana > Orleans Parish
- Canada > Ontario
- Oceania > Australia
- Asia > Japan
- Genre:
- Research Report > New Finding (0.67)
- Technology: