Na-Thalang, Adisai
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Cahyawijaya, Samuel, Lovenia, Holy, Moniz, Joel Ruben Antony, Wong, Tack Hwa, Farhansyah, Mohammad Rifqi, Maung, Thant Thiri, Hudi, Frederikus, Anugraha, David, Habibi, Muhammad Ravi Shulthan, Qorib, Muhammad Reza, Agarwal, Amit, Imperial, Joseph Marvin, Patel, Hitesh Laxmichand, Feliren, Vicky, Nasution, Bahrul Ilmi, Rufino, Manuel Antonio, Winata, Genta Indra, Rajagede, Rian Adam, Catalan, Carlos Rafael, Imam, Mohamed Fazli, Pattnayak, Priyaranjan, Pranida, Salsabila Zahirah, Pratama, Kevin, Bangera, Yeshil, Na-Thalang, Adisai, Monderin, Patricia Nicole, Song, Yueqi, Simon, Christian, Ng, Lynnette Hui Xian, Sapan, Richardy Lobo', Rafi, Taki Hasan, Wang, Bin, Supryadi, null, Veerakanjana, Kanyakorn, Ittichaiwong, Piyalitt, Roque, Matthew Theodore, Vincentio, Karissa, Kreangphet, Takdanai, Artkaew, Phakphum, Palgunadi, Kadek Hendrawan, Yu, Yanzhi, Hastuti, Rochana Prih, Nixon, William, Bangera, Mithil, Lim, Adrian Xuan Wei, Khine, Aye Hninn, Zhafran, Hanif Muhammad, Ferdinan, Teddy, Izzani, Audra Aurora, Singh, Ayushman, Evan, null, Krito, Jauza Akbar, Anugraha, Michael, Ilasariya, Fenal Ashokbhai, Li, Haochen, Daniswara, John Amadeo, Tjiaranata, Filbert Aurelian, Yulianrifat, Eryawan Presma, Udomcharoenchaikit, Can, Ansori, Fadil Risdian, Ihsani, Mahardika Krisna, Nguyen, Giang, Barik, Anab Maulana, Velasco, Dan John, Genadi, Rifo Ahmad, Saha, Saptarshi, Wei, Chengwei, Flores, Isaiah, Chen, Kenneth Ko Han, Santos, Anjela Gail, Lim, Wan Shen, Phyo, Kaung Si, Santos, Tim, Dwiastuti, Meisyarah, Luo, Jiayun, Cruz, Jan Christian Blaise, Hee, Ming Shan, Hanif, Ikhlasul Akmal, Hakim, M. Alif Al, Sya'ban, Muhammad Rizky, Kerdthaisong, Kun, Miranda, Lester James V., Koto, Fajri, Fatyanosa, Tirana Noor, Aji, Alham Fikri, Rosal, Jostin Jerico, Kevin, Jun, Wijaya, Robert, Kampman, Onno P., Zhang, Ruochen, Karlsson, Bรถrje F., Limkonchotiwat, Peerat
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
Pipatanakul, Kunat, Manakul, Potsawee, Nitarach, Natapong, Sirichotedumrong, Warit, Nonesung, Surapon, Jaknamon, Teetouch, Pengpun, Parinthapat, Taveekitworachai, Pittawat, Na-Thalang, Adisai, Sripaisarnmongkol, Sittipong, Jirayoot, Krisanapong, Tharnpipitchai, Kasima
This paper introduces Typhoon 2, a series of text and multimodal large language models optimized for the Thai language. The series includes models for text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models, such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture of English and Thai data. We employ post-training techniques to enhance Thai language performance while preserving the base models' original capabilities. We release text models across a range of sizes, from 1 to 70 billion parameters, available in both base and instruction-tuned variants. To guardrail text generation, we release Typhoon2-Safety, a classifier enhanced for Thai cultures and language. Typhoon2-Vision improves Thai document understanding while retaining general visual capabilities, such as image captioning. Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs.