Limkonchotiwat, Peerat
PyThaiNLP: Thai Natural Language Processing in Python
Phatthiyaphaibun, Wannaphong, Chaovavanich, Korakot, Polpanumas, Charin, Suriyawongkul, Arthit, Lowphansirikul, Lalita, Chormai, Pattarawat, Limkonchotiwat, Peerat, Suntorntip, Thanathip, Udomcharoenchaikit, Can
We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.
An Efficient Self-Supervised Cross-View Training For Sentence Embedding
Limkonchotiwat, Peerat, Ponwitayarat, Wuttikorn, Lowphansirikul, Lalita, Udomcharoenchaikit, Can, Chuangsuwanich, Ekapol, Nutanong, Sarana
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.
Typo-Robust Representation Learning for Dense Retrieval
Tasawong, Panuthep, Ponwitayarat, Wuttikorn, Limkonchotiwat, Peerat, Udomcharoenchaikit, Can, Chuangsuwanich, Ekapol, Nutanong, Sarana
Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github. com/panuthept/DST-DenseRetrieval.