Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation

Chen, Zihao, Handa, Hisashi, Ohsaki, Miho, Shirahama, Kimiaki

Mar-12-2025–arXiv.org Artificial Intelligence

Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset.

backbone model, dataset, negative sentence, (13 more...)

arXiv.org Artificial Intelligence

Mar-12-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Honshū
  - Kansai
    - Kyoto Prefecture > Kyoto (0.04)
    - Osaka Prefecture > Osaka (0.04)
  - Tōhoku (0.04)

Genre:
- Instructional Material (0.92)
- Research Report > New Finding (0.46)

Industry:
- Education (0.67)
- Health & Medicine > Therapeutic Area
  - Cardiology/Vascular Diseases (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Grammars & Parsing (0.87)
    - Large Language Model (0.93)
    - Machine Translation (0.93)
    - Text Processing (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found