Tang, Kenan
Creative and Context-Aware Translation of East Asian Idioms with GPT-4
Tang, Kenan, Song, Peiyang, Qin, Yao, Yan, Xifeng
As a type of figurative language, an East Asian idiom condenses rich cultural background into only a few characters. Translating such idioms is challenging for human translators, who often resort to choosing a context-aware translation from an existing list of candidates. However, compiling a dictionary of candidate translations demands much time and creativity even for expert translators. To alleviate such burden, we evaluate if GPT-4 can help generate high-quality translations. Based on automatic evaluations of faithfulness and creativity, we first identify Pareto-optimal prompting strategies that can outperform translation engines from Google and DeepL. Then, at a low cost, our context-aware translations can achieve far more high-quality translations per idiom than the human baseline. We open-source all code and data to facilitate further research.
Gender Biases Unexpectedly Fluctuate in the Pre-training Stage of Masked Language Models
Tang, Kenan, Jiang, Hanchun
Masked language models pick up gender biases during pre-training. Such biases are usually attributed to a certain model architecture and its pre-training corpora, with the implicit assumption that other variations in the pre-training process, such as the choices of the random seed or the stopping point, have no effect on the biases measured. However, we show that severe fluctuations exist at the fundamental level of individual templates, invalidating the assumption. Further against the intuition of how humans acquire biases, these fluctuations are not correlated with the certainty of the predicted pronouns or the profession frequencies in pre-training corpora. We release our code and data to benefit future research.