The Download: GPT-4o's polluted Chinese training data, and astronomy's AI challenge
Soon after OpenAI released GPT-4o last Monday, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases. Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. GPT-4o is supposed to be better than its predecessors at handling multi-language tasks, and many of the advances were achieved through a new tokenization tool that does a better job compressing texts in non-English languages. But, at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases--and experts say that's likely due to insufficient data cleaning and filtering before the tokenizer was trained. If left unresolved, it could lead to hallucinations, poor performance, and misuse.
May-20-2024, 12:10:00 GMT