AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

An A.I. Translation Tool Can Help Save Dying Languages. But at What Cost?

SlateJan-16-2023, 10:50:00 GMT

Sanjib Chaudhary chanced upon StoryWeaver, a multilingual children's storytelling platform, while searching for books he could read to his 7-year-old daughter. Chaudhary's mother tongue is Kochila Tharu, a language with about 250,000 speakers in eastern Nepal. Languages with a relatively small number of speakers, like Kochila Tharu, do not have enough digitized material for linguistic communities to thrive--no Google Translate, no film or television subtitles, no online newspapers. In industry parlance, these languages are "underserved" and "underresourced." This is where StoryWeaver comes in.

artificial intelligence, natural language, storyweaver, (18 more...)

Slate

AI-Alerts: 2023 > 2023-01 > AAAI AI-Alert for Jan 17, 2023 (1.00)

Country: Asia > Nepal (0.35)

Industry: Media > News (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual Understanding (XLU)

Upadhyay, Ankit Kumar, Upadhya, Harsit Kumar

arXiv.org Artificial IntelligenceJan-16-2023

Natural Language Processing systems are heavily dependent on the availability of annotated data to train practical models. Primarily, models are trained on English datasets. In recent times, significant advances have been made in multilingual understanding due to the steeply increasing necessity of working in different languages. One of the points that stands out is that since there are now so many pre-trained multilingual models, we can utilize them for cross-lingual understanding tasks. Using cross-lingual understanding and Natural Language Inference, it is possible to train models whose applications extend beyond the training language. We can leverage the power of machine translation to skip the tiresome part of translating datasets from one language to another. In this work, we focus on improving the original XNLI dataset by re-translating the MNLI dataset in all of the 14 different languages present in XNLI, including the test and dev sets of XNLI using Google Translate. We also perform experiments by training models in all 15 languages and analyzing their performance on the task of natural language inference. We then expand our boundary to investigate if we could improve performance in low-resource languages such as Swahili and Urdu by training models in languages other than English.

artificial intelligence, machine translation, natural language, (15 more...)

arXiv.org Artificial Intelligence

2301.06527

Country:

Asia > India > Karnataka > Bengaluru (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

NLP Startup Funding in 2022. It's no secret that the commercial…

#artificialintelligenceJan-13-2023, 08:45:27 GMT

It's no secret that the commercial application of NLP technologies has exploded in recent years. From chatbots and virtual assistants to machine translation and sentiment analysis, NLP technologies are now being used in a wide variety of applications across a range of industries. With the increasing demand for technologies that can process human language, investors have been eager to get a piece of the action. In this article, we look at NLP start-up funding over the past year, identifying the applications and domains that have received investment. A version of this article will appear in the Journal of Natural Language Engineering in early 2023.

machine learning, natural language, seed round, (23 more...)

#artificialintelligence

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Music Playlist Title Generation Using Artist Information

Kim, Haven, Doh, SeungHeon, Lee, Junwon, Nam, Juhan

arXiv.org Artificial IntelligenceJan-13-2023

Automatically generating or captioning music playlist titles given a set of tracks is of significant interest in music streaming services as customized playlists are widely used in personalized music recommendation, and well-composed text titles attract users and help their music discovery. We present an encoder-decoder model that generates a playlist title from a sequence of music tracks. While previous work takes track IDs as tokenized input for playlist title generation, we use artist IDs corresponding to the tracks to mitigate the issue from the long-tail distribution of tracks included in the playlist dataset. Also, we introduce a chronological data split method to deal with newly-released tracks in real-world scenarios. Comparing the track IDs and artist IDs as input sequences, we show that the artist-based approach significantly enhances the performance in terms of word overlap, semantic relevance, and diversity.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2301.08145

Country: North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Al-Kaswan, Ali, Ahmed, Toufique, Izadi, Maliheh, Sawant, Anand Ashok, Devanbu, Premkumar, van Deursen, Arie

arXiv.org Artificial IntelligenceJan-13-2023

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2301.01701

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Yolo County > Davis (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(7 more...)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

DeepL targets AI translation for enterprises with fresh $100 million

#artificialintelligenceJan-12-2023, 10:15:14 GMT

Check out all the on-demand sessions from the Intelligent Security Summit here. Seeking to target enterprise customers with AI language translation, Cologne, Germany-based DeepL announced a new funding raise that public reports estimate at well over $100 million. Language translation is an increasingly critical function for enterprises working across geographies and different demographics. Basic language translation capabilities have been available on for decades -- for example, services such as Google Translate. But the challenge has been enabling more advanced translation for business use cases that capture not just the literal meaning but the right tone and context.

artificial intelligence, machine translation, translation, (3 more...)

#artificialintelligence

Country: Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.29)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

Salar, Sazan, Hassani, Hossein

arXiv.org Artificial IntelligenceJan-12-2023

Named Entity Recognition (NER) is one of the essential applications of Natural Language Processing (NLP). It is also an instrument that plays a significant role in many other NLP applications, such as Machine Translation (MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is an under-resourced language from the NLP perspective. Particularly, in all the categories, the lack of NER resources hinders other aspects of Kurdish processing. In this work, we present a data set that covers several categories of NEs in Kurdish (Sorani). The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit). It covers 11 categories and 33261 entries in total. The dataset is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/.

artificial intelligence, natural language, text processing, (18 more...)

arXiv.org Artificial Intelligence

2301.04962

Country:

Asia > Middle East > Iraq > Kurdistan Region > Duhok Governorate > Duhok (0.07)
Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.07)

Genre: Research Report (0.40)

Industry: Government > Regional Government (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)

Add feedback

User-Centered Security in Natural Language Processing

Emmery, Chris

arXiv.org Artificial IntelligenceJan-10-2023

This dissertation proposes a framework of user-centered security in Natural Language Processing (NLP), and demonstrates how it can improve the accessibility of related research. Accordingly, it focuses on two security domains within NLP with great public interest. First, that of author profiling, which can be employed to compromise online privacy through invasive inferences. Without access and detailed insight into these models' predictions, there is no reasonable heuristic by which Internet users might defend themselves from such inferences. Secondly, that of cyberbullying detection, which by default presupposes a centralized implementation; i.e., content moderation across social platforms. As access to appropriate data is restricted, and the nature of the task rapidly evolves (both through lexical variation, and cultural shifts), the effectiveness of its classifiers is greatly diminished and thereby often misrepresented. Under the proposed framework, we predominantly investigate the use of adversarial attacks on language; i.e., changing a given input (generating adversarial samples) such that a given model does not function as intended. These attacks form a common thread between our user-centered security problems; they are highly relevant for privacy-preserving obfuscation methods against author profiling, and adversarial samples might also prove useful to assess the influence of lexical variation and augmentation on cyberbullying detection.

large language model, machine learning, text classification, (21 more...)

arXiv.org Artificial Intelligence

2301.0423

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Maryland > Baltimore (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(105 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.93)
Instructional Material > Course Syllabus & Notes (0.67)
(2 more...)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(7 more...)

Add feedback

Unsupervised Mandarin-Cantonese Machine Translation

Dare, Megan, Diaz, Valentina Fajardo, So, Averie Ho Zoen, Wang, Yifan, Zhang, Shibingfeng

arXiv.org Artificial IntelligenceJan-10-2023

Advancements in unsupervised machine translation have enabled the development of machine translation systems that can translate between languages for which there is not an abundance of parallel data available. We explored unsupervised machine translation between Mandarin Chinese and Cantonese. Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the language, due to the fact that Cantonese is primarily used for oral communication. The key contributions of our project include: 1. The creation of a new corpus containing approximately 1 million Cantonese sentences, and 2. A large-scale comparison across different model architectures, tokenization schemes, and embedding structures. Our best model trained with character-based tokenization and a Transformer architecture achieved a character-level BLEU of 25.1 when translating from Mandarin to Cantonese and of 24.4 when translating from Cantonese to Mandarin. In this paper we discuss our research process, experiments, and results.

cantonese, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2301.03971

Country:

Asia > China > Guangdong Province (0.14)
Asia > China > Hong Kong (0.07)
Europe > Germany > Saarland (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation

Korakakis, Michalis, Vlachos, Andreas

arXiv.org Artificial IntelligenceJan-10-2023

Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT'14 and WMT'14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.

computational linguistic, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2109.06308

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(20 more...)

Genre:

Research Report (0.64)
Instructional Material (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback