ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings
Jeon, Jangyeong, Cho, Sangyeon, Ma, Minuk, Kim, Junyoung
–arXiv.org Artificial Intelligence
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.
arXiv.org Artificial Intelligence
Aug-28-2024
- Country:
- North America
- United States
- Pennsylvania (0.04)
- Illinois (0.04)
- Washington > King County
- Seattle (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Canada > British Columbia
- United States
- Europe > Denmark
- Capital Region > Copenhagen (0.04)
- Asia
- Indonesia > Bali (0.04)
- South Korea > Seoul
- Seoul (0.04)
- North America
- Genre:
- Research Report > New Finding (0.68)
- Technology: