AITopics

2212.01641

Country:

Asia > British Indian Ocean Territory > Diego Garcia (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

#artificialintelligenceDec-2-2022, 20:39:51 GMT

The State of AI Language Translation & What The Future Holds - Big Data Analytics News

Artificial intelligence (AI) continuously wows or terrifies us, but there's no denying that AI will play an essential role in human development over the next decade. Machine translation, which has been around since the 1950s, will soon make extreme strides thanks to AI technologies. AI language translation is rooted in machine translation, which is a specialized technology that translates text without human assistance. While machine translation did come first, artificial intelligence translation and technology were developed side-by-side and aided their progress. That means that speech-to-text and the software that supports it have a symbiotic relationship.

ai language translation, big data analytic news, translation, (12 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Edman, Lukas, Toral, Antonio, van Noord, Gertjan

Subword-Delimited Downsampling for Better Character-Level Translation

arXiv.org Artificial IntelligenceDec-2-2022

Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.

artificial intelligence, machine learning, natural language, (18 more...)

2212.01304

Country: North America > United States > Pennsylvania (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.51)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Tarrés, Laia, Gàllego, Gerard I., Giró-i-Nieto, Xavier, Torres, Jordi

Tackling Low-Resourced Sign Language Translation: UPC at WMT-SLT 22

arXiv.org Artificial IntelligenceDec-2-2022

This paper describes the system developed at the Universitat Polit\`ecnica de Catalunya for the Workshop on Machine Translation 2022 Sign Language Translation Task, in particular, for the sign-to-text direction. We use a Transformer model implemented with the Fairseq modeling toolkit. We have experimented with the vocabulary size, data augmentation techniques and pretraining the model with the PHOENIX-14T dataset. Our system obtains 0.50 BLEU score for the test set, improving the organizers' baseline by 0.38 BLEU. We remark the poor results for both the baseline and our system, and thus, the unreliability of our findings.

artificial intelligence, machine translation, natural language, (16 more...)

2212.0114

Country:

Europe > Spain (0.14)
Europe > Switzerland > Aargau > Aarau (0.04)
North America > United States > Pennsylvania (0.04)
(5 more...)

Genre: Research Report > New Finding (0.34)

Industry: Education > Curriculum > Subject-Specific Education (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceDec-2-2022

Improving Simultaneous Machine Translation with Monolingual Data

Deng, Hexuan, Ding, Liang, Liu, Xuebo, Zhang, Meishan, Tao, Dacheng, Zhang, Min

Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model. However, there is still a significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh and En-Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired by the behavior of human simultaneous interpreters, we propose a novel monolingual sampling strategy for SiMT, considering both chunk length and monotonicity. Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of SiMT -- hallucination, and has better scalability. We achieve +0.72 BLEU improvements on average against random sampling on En-Zh and En-Ja. Data and codes can be found at https://github.com/hexuandeng/Mono4SiMT.

artificial intelligence, monolingual data, natural language, (14 more...)

2212.01188

Country:

South America (0.05)
North America > Belize (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Popel, Martin, Libovický, Jindřich, Helcl, Jindřich

CUNI Systems for the WMT22 Czech-Ukrainian Translation Task

We present Charles University submissions to the WMT22 General Translation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-based romanization of Ukrainian. Our results show that the romanization only has a minor effect on the translation quality. Further, we describe Charles Translator, a system that was developed in March 2022 as a response to the migration from Ukraine to the Czech Republic. Compared to our constrained systems, it did not use the romanization and used some proprietary data sources.

artificial intelligence, machine learning, natural language, (15 more...)

2212.00486

Country:

Europe > Ukraine (0.26)
North America > United States > California > Los Angeles County > Long Beach (0.14)
Europe > Germany > Saxony > Leipzig (0.05)
(8 more...)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Swati, Swati, Grobelnik, Adrian Mladenić, Mladenić, Dunja, Grobelnik, Marko

A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing Prediction of Political Polarity in Multilingual News Headlines

Predicting the political polarity of news headlines is a challenging task that becomes even more challenging in a multilingual setting with low-resource languages. To deal with this, we propose to utilise the Inferential Commonsense Knowledge via a Translate-Retrieve-Translate strategy to introduce a learning framework. To begin with, we use the method of translation and retrieval to acquire the inferential knowledge in the target language. We then employ an attention mechanism to emphasise important inferences. We finally integrate the attended inferences into a multilingual pre-trained language model for the task of bias prediction. To evaluate the effectiveness of our framework, we present a dataset of over 62.6K multilingual news headlines in five European languages annotated with their respective political polarities. We evaluate several state-of-the-art multilingual pre-trained language models since their performance tends to vary across languages (low/high resource). Evaluation results demonstrate that our proposed framework is effective regardless of the models employed. Overall, the best performing model trained with only headlines show 0.90 accuracy and F1, and 0.83 jaccard score. With attended knowledge in our framework, the same model show an increase in 2.2% accuracy and F1, and 3.6% jaccard score. Extending our experiments to individual languages reveals that the models we analyze for Slovenian perform significantly worse than other languages in our dataset. To investigate this, we assess the effect of translation quality on prediction performance. It indicates that the disparity in performance is most likely due to poor translation quality. We release our dataset and scripts at: https://github.com/Swati17293/KG-Multi-Bias for future research. Our framework has the potential to benefit journalists, social scientists, news producers, and consumers.

knowledge management, machine learning, natural language, (19 more...)

2212.00298

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
(5 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.87)

Industry:

Media > News (1.00)
Health & Medicine (1.00)
Information Technology > Security & Privacy (0.92)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(2 more...)

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task

Helcl, Jindřich

We present a non-autoregressive system submission to the WMT 22 Efficient Translation Shared Task. Our system was used by Helcl et al. (2022) in an attempt to provide fair comparison between non-autoregressive and autoregressive models. This submission is an effort to establish solid baselines along with sound evaluation methodology, particularly in terms of measuring the decoding speed. The model itself is a 12-layer Transformer model trained with connectionist temporal classification on knowledge-distilled dataset by a strong autoregressive teacher model.

artificial intelligence, natural language, translation, (14 more...)

2212.00477

Country:

Oceania > Australia > Victoria > Melbourne (0.05)
Europe > Belgium > Brussels-Capital Region > Brussels (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.05)
(6 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.59)

Long-Document Cross-Lingual Summarization

Zheng, Shaohui, Li, Zhixu, Wang, Jiaan, Qu, Jianfeng, Liu, An, Zhao, Lei, Chen, Zhigang

Cross-Lingual Summarization (CLS) aims at generating summaries in one language for the given documents in another language. CLS has attracted wide research attention due to its practical significance in the multi-lingual world. Though great contributions have been made, existing CLS works typically focus on short documents, such as news articles, short dialogues and guides. Different from these short texts, long documents such as academic articles and business reports usually discuss complicated subjects and consist of thousands of words, making them non-trivial to process and summarize. To promote CLS research on long documents, we construct Perseus, the first long-document CLS dataset which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens. As a preliminary study on long-document CLS, we build and evaluate various CLS baselines, including pipeline and end-to-end methods. Experimental results on Perseus show the superiority of the end-to-end baseline, outperforming the strong pipeline models equipped with sophisticated machine translation systems. Furthermore, to provide a deeper understanding, we manually analyze the model outputs and discuss specific challenges faced by current approaches. We hope that our work could benchmark long-document CLS and benefit future studies.

artificial intelligence, machine learning, natural language, (19 more...)

2212.00586

Country:

Asia > China > Hubei Province > Wuhan (0.05)
Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceNov-30-2022

Word Alignment in the Era of Deep Learning: A Tutorial

Li, Bryan

The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.

artificial intelligence, machine learning, natural language, (14 more...)

2212.00138

Country:

Africa > Middle East > Egypt > Giza Governorate > Giza (0.46)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(34 more...)

Genre:

Overview (0.93)
Workflow (0.93)
Instructional Material > Course Syllabus & Notes (0.84)
Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)