LCP-dropout: Compression-based Multiple Subword Segmentation for Neural Machine Translation
Nonaka, Keita, Yamanouchi, Kazutaka, I, Tomohiro, Okita, Tsuyoshi, Shimada, Kazutaka, Sakamoto, Hiroshi
–arXiv.org Artificial Intelligence
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in Neural Machine Translation. Among them, BPE/BPE-dropout is one of the fastest and most effective method compared to conventional approaches. However, compression-based approach has a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a probabilistic string algorithm, called locally-consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the probabilistic mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and show that it outperforms various baselines in learning from especially small training data.
arXiv.org Artificial Intelligence
Mar-19-2022
- Country:
- Oceania > Australia
- North America
- United States
- Nevada (0.04)
- Washington > King County
- Seattle (0.04)
- Texas > Travis County
- Austin (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California
- San Francisco County > San Francisco (0.14)
- Los Angeles County > Long Beach (0.04)
- Canada
- Quebec > Montreal (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- United States
- Europe
- Asia > Japan
- Kyūshū & Okinawa > Kyūshū
- Fukuoka Prefecture > Fukuoka (0.04)
- Honshū > Kansai
- Kyoto Prefecture > Kyoto (0.04)
- Kyūshū & Okinawa > Kyūshū
- Genre:
- Research Report > New Finding (0.68)
- Technology: