AITopics

2305.04203

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

arXiv.org Artificial IntelligenceMay-6-2023

Label-Free Multi-Domain Machine Translation with Stage-wise Training

Zhang, Fan, Tu, Mei, Kim, Sangha, Liu, Song, Yan, Jinyao

Most multi-domain machine translation models rely on domain-annotated data. Unfortunately, domain labels are usually unavailable in both training processes and real translation scenarios. In this work, we propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference. Our model is composed of three parts: a backbone model, a domain discriminator taking responsibility to discriminate data from different domains, and a set of experts that transfer the decoded features from generic to specific. We design a stage-wise training strategy and train the three parts sequentially. To leverage the extra domain knowledge and improve the training stability, in the discriminator training stage, domain differences are modeled explicitly with clustering and distilled into the discriminator through a multi-classification task. Meanwhile, the Gumbel-Max sampling is adopted as the routing scheme in the expert training stage to achieve the balance of each expert in specialization and generalization. Experimental results on the German-to-English translation task show that our model significantly improves BLEU scores on six different domains and even outperforms most of the models trained with domain-annotated data.

machine learning, natural language, translation, (18 more...)

2305.03949

Country:

North America > United States (0.14)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

arXiv.org Artificial IntelligenceMay-6-2023

Vega-MT: The JD Explore Academy Translation System for WMT22

Zan, Changtong, Peng, Keqin, Ding, Liang, Qiu, Baopu, Liu, Boan, He, Shwai, Lu, Qingyu, Zhang, Zheng, Liu, Chuang, Liu, Weifeng, Zhan, Yibing, Tao, Dacheng

We describe the JD Explore Academy's submission of the WMT 2022 shared general translation task. We participated in all high-resource tracks and one medium-resource track, including Chinese-English, German-English, Czech-English, Russian-English, and Japanese-English. We push the limit of our previous work -- bidirectional training for translation by scaling up two main factors, i.e. language pairs and model sizes, namely the \textbf{Vega-MT} system. As for language pairs, we scale the "bidirectional" up to the "multidirectional" settings, covering all participating languages, to exploit the common knowledge across languages, and transfer them to the downstream bilingual tasks. As for model sizes, we scale the Transformer-Big up to the extremely large model that owns nearly 4.7 Billion parameters, to fully enhance the model capacity for our Vega-MT. Also, we adopt the data augmentation strategies, e.g. cycle translation for monolingual data, and bidirectional self-training for bilingual and monolingual data, to comprehensively exploit the bilingual and monolingual data. To adapt our Vega-MT to the general domain test set, generalization tuning is designed. Based on the official automatic scores of constrained systems, in terms of the sacreBLEU shown in Figure-1, we got the 1st place on {Zh-En (33.5), En-Zh (49.7), De-En (33.7), En-De (37.8), Cs-En (54.9), En-Cs (41.4) and En-Ru (32.7)}, 2nd place on {Ru-En (45.1) and Ja-En (25.6)}, and 3rd place on {En-Ja(41.5)}, respectively; W.R.T the COMET, we got the 1st place on {Zh-En (45.1), En-Zh (61.7), De-En (58.0), En-De (63.2), Cs-En (74.7), Ru-En (64.9), En-Ru (69.6) and En-Ja (65.1)}, 2nd place on {En-Cs (95.3) and Ja-En (40.6)}, respectively.

machine learning, natural language, translation, (17 more...)

2209.09444

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Zhou, Zhong, Niehues, Jan, Waibel, Alex

In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.

machine translation, proceedings, translation, (11 more...)

2305.03873

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
Oceania (0.04)
(11 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine > Health Care Providers & Services (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

In-context Learning as Maintaining Coherency: A Study of On-the-fly Machine Translation Using Large Language Models

Sia, Suzanna, Duh, Kevin

The phenomena of in-context learning has typically been thought of as "learning from examples". In this work which focuses on Machine Translation, we present a perspective of in-context learning as the desired generation task maintaining coherency with its context, i.e., the prompt examples. We first investigate randomly sampled prompts across 4 domains, and find that translation performance improves when shown in-domain prompts. Next, we investigate coherency for the in-domain setting, which uses prompt examples from a moving window. We study this with respect to other factors that have previously been identified in the literature such as length, surface similarity and sentence embedding similarity. Our results across 3 models (GPTNeo2.7B, Bloom3B, XGLM2.9B), and three translation directions (\texttt{en}$\rightarrow$\{\texttt{pt, de, fr}\}) suggest that the long-term coherency of the prompts and the test sentence is a good indicator of downstream translation performance. In doing so, we demonstrate the efficacy of In-context Machine Translation for on-the-fly adaptation.

coherence, large language model, natural language, (17 more...)

2305.03573

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Joshi, Nisheeth, Katyayan, Pragya

Implications of Multi-Word Expressions on English to Bharti Braille Machine Translation

In this paper, we have shown the improvement of English to Bharti Braille machine translation system. We have shown how we can improve a baseline NMT model by adding some linguistic knowledge to it. This was done for five language pairs where English sentences were translated into five Indian languages and then subsequently to corresponding Bharti Braille. This has been demonstrated by adding a sub-module for translating multi-word expressions. The approach shows promising results as across language pairs, we could see improvement in the quality of NMT outputs. The least improvement was observed in English-Nepali language pair with 22.08% and the most improvement was observed in the English-Hindi language pair with 23.30%.

artificial intelligence, braille, natural language, (15 more...)

doi: 10.1109/ISCON57294.2023.10112137

2305.06157

Country:

Asia > India > Rajasthan (0.05)
Europe > France > Île-de-France > Paris > Paris (0.04)
Asia > Singapore (0.04)
Asia > China (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (0.49)
Education (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Samuel, David, Kutuzov, Andrey, Øvrelid, Lilja, Velldal, Erik

Trained on 100 million words and still in shape: BERT meets British National Corpus

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

large language model, machine learning, natural language, (21 more...)

2303.09859

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(13 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
(2 more...)

NorBench -- A Benchmark for Norwegian Language Models

Samuel, David, Kutuzov, Andrey, Touileb, Samia, Velldal, Erik, Øvrelid, Lilja, Rønningstad, Egil, Sigdel, Elina, Palatkina, Anna

We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

artificial intelligence, machine learning, natural language, (19 more...)

2305.0388

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Europe > Iceland > Capital Region > Reykjavik (0.04)
(16 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Ahmad, Wasi Uddin, Tushar, Md Golam Rahman, Chakraborty, Saikat, Chang, Kai-Wei

AVATAR: A Parallel Corpus for Java-Python Program Translation

arXiv.org Artificial IntelligenceMay-4-2023

Program translation refers to migrating source code from one programming language to another. It has tremendous practical value in software development, as porting software across languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enables supervised fine-tuning with a small number of labeled examples. Therefore, we present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python. AVATAR is collected from competitive programming sites, online platforms, and open-source repositories. Furthermore, AVATAR includes unit tests for 250 examples to facilitate functional correctness evaluation. We benchmark several pre-trained language models fine-tuned on AVATAR. Experiment results show that the models lack in generating functionally accurate code.

large language model, machine learning, translation, (18 more...)

2108.1159

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Oceania > Australia (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > Dominican Republic (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceMay-4-2023

Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence

Li, Haoran, Xu, Mingshi, Song, Yangqiu

Sentence-level representations are beneficial for various natural language processing tasks. It is commonly believed that vector representations can capture rich linguistic properties. Currently, large language models (LMs) achieve state-of-the-art performance on sentence embedding. However, some recent works suggest that vector representations from LMs can cause information leakage. In this work, we further investigate the information leakage issue and propose a generative embedding inversion attack (GEIA) that aims to reconstruct input sequences based only on their sentence embeddings. Given the black-box access to a language model, we treat sentence embeddings as initial tokens' representations and train or fine-tune a powerful decoder model to decode the whole sequences directly. We conduct extensive experiments to demonstrate that our generative inversion attack outperforms previous embedding inversion attacks in classification metrics and generates coherent and contextually similar sentences as the original inputs.

inversion attack, machine learning, natural language, (21 more...)

2305.0301

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(10 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Law (0.93)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)