Goto

Collaborating Authors

 hokkien


Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems

Lu, Bo-Han, Lin, Yi-Hsuan, Lee, En-Shiun Annie, Tsai, Richard Tzong-Han

arXiv.org Artificial Intelligence

Machine translation focuses mainly on high-resource languages (HRLs), while low-resource languages (LRLs) like Taiwanese Hokkien are relatively under-explored. The study aims to address this gap by developing a dual translation model between Taiwanese Hokkien and both Traditional Mandarin Chinese and English. We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Taiwanese Hokkien Han and Traditional Mandarin Chinese. Our comprehensive experiments involve translation tasks across various writing systems of Taiwanese Hokkien as well as between Taiwanese Hokkien and other HRLs. We find that the use of a limited monolingual corpus still further improves the model's Taiwanese Hokkien capabilities. We then utilize our translation model to standardize all Taiwanese Hokkien writing systems into Hokkien Han, resulting in further performance improvements. Additionally, we introduce an evaluation method incorporating back-translation and GPT-4 to ensure reliable translation quality assessment even for LRLs. The study contributes to narrowing the resource gap for Taiwanese Hokkien and empirically investigates the advantages and limitations of pre-training and fine-tuning based on LLaMA 2. Keywords: low-resource language, large language model, neural machine translation, Taiwanese Hokkien


Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Lu, Sin-En, Lu, Bo-Han, Lu, Chao-Yi, Tsai, Richard Tzong-Han

arXiv.org Artificial Intelligence

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.


Speech-to-Speech Translation For A Real-world Unwritten Language

Chen, Peng-Jen, Tran, Kevin, Yang, Yilin, Du, Jingfei, Kao, Justine, Chung, Yu-An, Tomasello, Paden, Duquenne, Paul-Ambroise, Schwenk, Holger, Gong, Hongyu, Inaguma, Hirofumi, Popuri, Sravya, Wang, Changhan, Pino, Juan, Hsu, Wei-Ning, Lee, Ann

arXiv.org Artificial Intelligence

We study speech-to-speech translation (S2ST) that translates speech from one language into another language and focuses on building systems to support languages without standard text writing systems. We use English-Taiwanese Hokkien as a case study, and present an end-to-end solution from training data collection, modeling choices to benchmark dataset release. First, we present efforts on creating human annotated data, automatically mining data from large unlabeled speech datasets, and adopting pseudo-labeling to produce weakly supervised data. On the modeling, we take advantage of recent advances in applying self-supervised discrete representations as target for prediction in S2ST and show the effectiveness of leveraging additional text supervision from Mandarin, a language similar to Hokkien, in model training. Finally, we release an S2ST benchmark set to facilitate future research in this field. The demo can be found at https://huggingface.co/spaces/facebook/Hokkien_Translation .


Why Meta developed an AI translation system? - FutureTech

#artificialintelligence

In an effort to break down language barriers, Meta has created a new AI translator that can convert spoken languages such as Hokkien into spoken English. Hokkien, a dialect of southern Min Chinese, is primarily spoken and lacks a standard writing system, making it difficult to develop translation tools for it. The open-source translation system, which is part of Meta's Universal Speech Translator (UST) project, has made significant progress in this challenge. The company, formerly known as Facebook, hopes that this, along with other AI methods in development, will eventually allow for real-time speech-to-speech translation across hundreds of languages, including spoken languages. Languages such as Hokkien are difficult to translate because machine translation tools need a large amount of written text to train on, and such languages lack a widely used writing system.


Meta AI powers spoken-only language translation

#artificialintelligence

After plans to break physical barriers with his metaverse initiative, Meta CEO Mark Zuckerberg revealed plans for another globe-spanning artificial intelligence (AI) project earlier this year, this time a universal translation tool unlike any other. At the same time, the company that made itself famous (and notorious) for its social media networks also introduced another AI-powered tool, a virtual assistant. Both of these intelligent applications were intended to have practical use cases in Zuckerberg's metaverse, those were their intended uses but they will also have wider business applications that Meta is all too aware of. AI virtual assistants, of course, are already in wider use by organizations as chatbots to handle basic customer requests and interactions across a variety of digital services– including Meta's own popular platforms like Facebook Messenger, Instagram, and WhatsApp Business. The other, less well-known AI use case(s) is the language and translation exercises that provide alternatives to relying on human translators to provide accurate, expert-quality translations in real-time.


Perceptron: AI saving whales, steadying gaits and banishing traffic

#artificialintelligence

Research in the field of machine learning and AI, now a key technology in practically every industry and company, is far too voluminous for anyone to read it all. This column, Perceptron, aims to collect some of the most relevant recent discoveries and papers -- particularly in, but not limited to, artificial intelligence -- and explain why they matter. Over the past few weeks, researchers at MIT have detailed their work on a system to track the progression of Parkinson's patients by continuously monitoring their gait speed. Elsewhere, Whale Safe, a project spearheaded by the Benioff Ocean Science Laboratory and partners, launched buoys equipped with AI-powered sensors in an experiment to prevent ships from striking whales. Other aspects of ecology and academics also saw advances powered by machine learning.


The Morning After: The Silent Hill universe is expanding, with help from J.J. Abrams

Engadget

Konami today dropped a ton of news about the future of its iconic horror franchise. Aside from confirming that remake of Silent Hill 2, the studio revealed three new games. Townfall comes from Annapurna Interactive and No Code, a Glasgow studio known for strong narrative titles like Observation and Stories Untold. The short teaser for Townfall looks to be the most traditional Silent Hill game of the trio. Ascension, due out in 2023, is the least game-like installment, but it will feature the influence of J.J. Abrams.

  Country:
  Industry:

Meta Has Developed AI for Real-Time Translation of Hokkien

#artificialintelligence

Meta is chugging along on their Universal Speech Translator, which hopes to train an artificial intelligence to translate hundreds of languages in real time. Today, the tech giant claims to have generated the first artificial intelligence to translate Hokkien, which is a language primarily spoken and not written. Hokkien is a language that is spoken by approximately 49 million people in countries like China, Taiwan, Singapore, Malaysia, and the Phillippines. Typically, training an AI to understand human speech--and in Meta's case, translation--researchers will feed the computer a large dataset of written transcripts. But Meta says that Hokkien is once of nearly 3,500 languages that are primarily spoken, meaning Hokkien does not have a large enough dataset to train the artificial intelligence since the language does not have a unified writing system.

  Country:
  Industry: Information Technology (0.39)

Meta AI announces first AI-powered speech translation system for an unwritten language

#artificialintelligence

Did you miss a session from MetaBeat 2022? Head over to the on-demand library for all of our featured sessions here. Artificial speech translation is a rapidly emerging artificial intelligence (AI) technology. Initially created to aid communication among people who speak different languages, this speech-to-speech translation technology (S2ST) has found its way into several domains. For example, global tech conglomerates are now using S2ST for directly translating shared documents and audio conversations in the metaverse.


Meta's AI translator can interpret unwritten languages

Engadget

Nearly half of the world's roughly 7,000 known languages four in ten of them exist without an accompanying written component. These unwritten languages pose a unique problem for modern machine learning translation systems, as they typically need to convert verbal speech to written words before translating to the new language and reverting the text back to speech, but one that Meta has reportedly addressed with its latest open-source language AI advancement. As part of Meta's Universal Speech Translator (UST) program which is working to develop real-time speech-to-speech translation so that Metaverse denizens can more easily interact (read: sexually harass one another). As part of this project, Meta researchers looked at Hokkien, an unwritten language spoken throughout Asia's diaspora and one of Taiwan's official languages. Machine learning translation systems typically require extensive labelable examples of the language, both written and spoken, to train on -- precisely what unwritten languages like Hokkien don't have.