Knowledge of Pretrained Language Models on Surface Information of Tokens

Feb-22-2024–arXiv.org Artificial Intelligence

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained Figure 1: Input and output examples when asking GPT-language models have knowledge regarding token 3.5 Turbo about the surface information of words (as length and substrings but not token constitution. of 1st, Jan. 2024). The Japanese example has the same Additionally, the results imply that there meaning as the English text, asking the length of and is a bottleneck on the decoder side in terms of third character in 人類学者 (anthropologist).

information, knowledge, surface information, (14 more...)

arXiv.org Artificial Intelligence

Feb-22-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Europe
  - Czechia > South Moravian Region
    - Brno (0.04)
  - Germany > Berlin (0.04)
- North America
  - Canada > Ontario
    - Toronto (0.04)
  - United States
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
- Oceania > Australia
  - Victoria > Melbourne (0.04)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found