The Download: GPT-4o's polluted Chinese training data, and astronomy's AI challenge

May-20-2024, 12:10:00 GMT–MIT Technology Review

Soon after OpenAI released GPT-4o last Monday, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. GPT-4o is supposed to be better than its predecessors at handling multi-language tasks, and many of the advances were achieved through a new tokenization tool that does a better job compressing texts in non-English languages. But, at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases--and experts say that's likely due to insufficient data cleaning and filtering before the tokenizer was trained. If left unresolved, it could lead to hallucinations, poor performance, and misuse.

large language model, machine learning, natural language, (11 more...)

MIT Technology Review

May-20-2024, 12:10:00 GMT

News Web Page

Add feedback

Country:
- Oceania > Australia (0.08)
- Africa > South Africa (0.08)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)