latest blunder show
OpenAI's latest blunder shows the challenges facing Chinese AI models
Add to that another thing OpenAI fumbled with GPT-4o: the data it used to train its tokenizer--a tool that helps the model parse and process text more efficiently--is polluted by Chinese spam websites. As a result, the model's Chinese token library is full of phrases related to pornography and gambling. This could worsen some problems that are common with AI models: hallucinations, poor performance, and misuse. I wrote about it on Friday after several researchers and AI industry insiders flagged the problem. They took a look at GPT-4o's public token library, which has been significantly updated with the new model to improve support of non-English languages, and saw that more than 90 of the 100 longest Chinese tokens in the model are from spam websites.