AITopics

2204.0761

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Norway > Central Norway > Trøndelag > Trondheim (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(4 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)

Salemi, Alireza, Abaskohi, Amirhossein, Tavakoli, Sara, Yaghoobzadeh, Yadollah, Shakery, Azadeh

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

arXiv.org Artificial IntelligenceApr-14-2023

Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.

objective, peach, translation, (15 more...)

2304.01282

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Liguori, Pietro, Improta, Cristina, Natella, Roberto, Cukic, Bojan, Cotroneo, Domenico

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

arXiv.org Artificial IntelligenceApr-13-2023

AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.

machine learning, natural language, programming language, (19 more...)

doi: 10.1016/j.eswa.2023.120073

2212.06008

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
(16 more...)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceApr-13-2023

PePe: Personalized Post-editing Model utilizing User-generated Post-edits

Lee, Jihyeon, Kim, Taehee, Tae, Yunwon, Park, Cheonbok, Choo, Jaegul

Incorporating personal preference is crucial in advanced machine translation tasks. Despite the recent advancement of machine translation, it remains a demanding task to properly reflect personal style. In this paper, we introduce a personalized automatic post-editing framework to address this challenge, which effectively generates sentences considering distinct Figure 1: Example of a personal post-editing triplet personal behaviors. To build this framework, (i.e., source (src), machine translation (mt), and postedit we first collect post-editing data that connotes (pe)) given the source text in English and the translated the user preference from a live machine translation text in Korean. A post-edited sentence does not system. Specifically, real-world users enter only contain error correction of an initial machine translation source sentences for translation and edit result but also reflects individual preference. For the machine-translated outputs according to instance, a human post-editor modifies the word "primarily" the user's preferred style. We then propose to "primary," but also change " 공헌 " to its synonym a model that combines a discriminator module " 기여 " while keeping the rest as it is (e.g., "research").

artificial intelligence, natural language, translation, (18 more...)

2209.10139

Country:

North America > United States (0.14)
Asia > Singapore (0.04)
Asia > Malaysia (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Learning Homographic Disambiguation Representation for Neural Machine Translation

Wang, Weixuan, Peng, Wei, Liu, Qun

Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT). While recent works leverage various word embedding approaches to differentiate word sense in NMT, they do not focus on the pivotal components in resolving ambiguities of homographs in NMT: the hidden states of an encoder. In this paper, we propose a novel approach to tackle homographic issues of NMT in the latent space. We first train an encoder (aka "HDR-encoder") to learn universal sentence representations in a natural language inference (NLI) task. We further fine-tune the encoder using homograph-based synset sentences from WordNet, enabling it to learn word-level homographic disambiguation representations (HDR). The pre-trained HDR-encoder is subsequently integrated with a transformer-based NMT in various schemes to improve translation accuracy. Experiments on four translation directions demonstrate the effectiveness of the proposed method in enhancing the performance of NMT systems in the BLEU scores (up to +2.3 compared to a solid baseline). The effects can be verified by other metrics (F1, precision, and recall) of translation accuracy in an additional disambiguation task. Visualization methods like heatmaps, T-SNE and translation examples are also utilized to demonstrate the effects of the proposed method.

artificial intelligence, machine translation, natural language, (15 more...)

2304.0586

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
Europe > Czechia > Prague (0.04)
Asia > North Korea (0.04)
(11 more...)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Robertson, Samantha, Wang, Zijie J., Moritz, Dominik, Kery, Mary Beth, Hohman, Fred

Angler: Helping Machine Translation Practitioners Prioritize Model Improvements

Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences.

data mining, machine learning, natural language, (18 more...)

doi: 10.1145/3544548.3580790

2304.05967

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Europe > Germany > Hamburg (0.05)
Europe > Czechia > Prague (0.04)
(11 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Winata, Genta Indra, Aji, Alham Fikri, Cahyawijaya, Samuel, Mahendra, Rahmad, Koto, Fajri, Romadhony, Ade, Kurniawan, Kemal, Moeljadi, David, Prasojo, Radityo Eko, Fung, Pascale, Baldwin, Timothy, Lau, Jey Han, Sennrich, Rico, Ruder, Sebastian

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

computational linguistic, machine learning, natural language, (19 more...)

2205.1596

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(30 more...)

Genre: Research Report (0.82)

Industry: Education > Educational Setting (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Sharma, Mandar, Gogineni, Ajay, Ramakrishnan, Naren

Innovations in Neural Data-to-text Generation: A Survey

The neural boom that has sparked natural language processing (NLP) research through the last decade has similarly led to significant innovations in data-to-text generation (DTG). This survey offers a consolidated view into the neural DTG paradigm with a structured examination of the approaches, benchmark datasets, and evaluation protocols. This survey draws boundaries separating DTG from the rest of the natural language generation (NLG) landscape, encompassing an up-to-date synthesis of the literature, and highlighting the stages of technological adoption from within and outside the greater NLG umbrella. With this holistic view, we highlight promising avenues for DTG research that not only focus on the design of linguistically capable systems but also systems that exhibit fairness and accountability.

large language model, machine learning, natural language, (18 more...)

2207.12571

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Kansas (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(7 more...)

#artificialintelligenceApr-11-2023, 09:50:26 GMT

A Closer Look At Translation Hub: Enterprise Translation Made Easy - Liwaiwai

In this article, we'll take a closer look at Translation Hub's powerful features, and the ways it is helping customers do more with their content. Translation Hub is a fully-managed, self-serve translation offering, powered by Google AI and built for the enterprise. With Translation Hub, businesses can instantaneously translate content into 135 languages with a single click, via an intuitive interface that integrates human reviews (i.e., a "human in the loop") where required. Organizations need to be able to share the output of AI-powered translation with localization teams or agencies, for review. They need to save time by leveraging glossaries or customer machine learning (ML) models.

enterprise translation, liwaiwai, translation hub, (5 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.59)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.56)

Neural Information Processing SystemsApr-10-2023, 11:52:39 GMT

Attention-Based Models for Speech Recognition

Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [1, 2] and image caption generation [3]. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in [2] reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the attention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

attention mechanism, mechanism, utterance, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada > Quebec > Montreal (0.05)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
Europe > Germany > Bremen > Bremen (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.87)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)