AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

An Open-Source American Sign Language Fingerspell Recognition and Semantic Pose Retrieval Interface

Thomas, Kevin Jose

arXiv.org Artificial IntelligenceAug-17-2024

This paper introduces an open-source interface for American Sign Language fingerspell recognition and semantic pose retrieval, aimed to serve as a stepping stone towards more advanced sign language translation systems. Utilizing a combination of convolutional neural networks and pose estimation models, the interface provides two modular components: a recognition module for translating ASL fingerspelling into spoken English and a production module for converting spoken English into ASL pose sequences. The system is designed to be highly accessible, user-friendly, and capable of functioning in real-time under varying environmental conditions like backgrounds, lighting, skin tones, and hand sizes. We discuss the technical details of the model architecture, application in the wild, as well as potential future enhancements for real-world consumer applications.

application, interface, sign language, (16 more...)

arXiv.org Artificial Intelligence

2408.09311

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Burnaby (0.04)
Europe > Finland > Pirkanmaa > Tampere (0.04)

Genre: Research Report (0.50)

Industry: Education > Curriculum > Subject-Specific Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability

Jung, Kyudan, Hyeon, Sieun, Kwon, Jeong Youn, Kim, Nam-Joon, Ryu, Hyun Gon, Lee, Hyuk-Jae, Do, Jaeyoung

arXiv.org Artificial IntelligenceAug-16-2024

Improving the readability of mathematical expressions in text-based document such as subtitle of mathematical video, is an significant task. To achieve this, mathematical expressions should be convert to compiled formulas. For instance, the spoken expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition is more readily comprehensible when displayed as a compiled formula $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. To convert mathematical spoken sentences to compiled formulas, two processes are required: spoken sentences are converted into LaTeX formulas, and LaTeX formulas are converted into compiled formulas. The latter can be managed by using LaTeX engines. However, there is no way to do the former effectively. Even if we try to solve this using language models, there is no paired data between spoken sentences and LaTeX formulas to train it. In this paper, we introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into LaTeX formulas. MathBridge comprises approximately 23 million LaTeX formulas paired with the corresponding mathematical spoken sentences. Through comprehensive evaluations, including fine-tuning with proposed data, we discovered that MathBridge significantly enhances the capabilities of pretrained language models for converting to LaTeX formulas from mathematical spoken sentences. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement.

dataset, formula, latex formula, (13 more...)

arXiv.org Artificial Intelligence

2408.07081

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Middle East > Malta (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.86)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

PyMarian: Fast Neural Machine Translation and Evaluation in Python

Gowda, Thamme, Grundkiewicz, Roman, Rippeth, Elijah, Post, Matt, Junczys-Dowmunt, Marcin

arXiv.org Artificial IntelligenceAug-14-2024

The deep learning language of choice these days is Python; measured by factors such as available libraries and technical support, it is hard to beat. At the same time, software written in lower-level programming languages like C++ retain advantages in speed. We describe a Python interface to Marian NMT, a C++-based training and inference toolkit for sequence-to-sequence models, focusing on machine translation. This interface enables models trained with Marian to be connected to the rich, wide range of tools available in Python. A highlight of the interface is the ability to compute state-of-the-art COMET metrics from Python but using Marian's inference engine, with a speedup factor of up to 7.8$\times$ the existing implementations. We also briefly spotlight a number of other integrations, including Jupyter notebooks, connection with prebuilt models, and a web app interface provided with the package. PyMarian is available in PyPI via $\texttt{pip install pymarian}$.

implementation, machine translation, translation, (10 more...)

arXiv.org Artificial Intelligence

2408.11853

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(5 more...)

Genre: Research Report (0.40)

Industry: Information Technology > Software (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

CMU's IWSLT 2024 Simultaneous Speech Translation System

Xu, Xi, Ouyang, Siqi, Yan, Brian, Fernandes, Patrick, Chen, William, Li, Lei, Neubig, Graham, Watanabe, Shinji

arXiv.org Artificial IntelligenceAug-14-2024

This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

iwslt 2024, speech, translation, (12 more...)

arXiv.org Artificial Intelligence

2408.07452

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
Europe > Italy (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Advancing Post-OCR Correction: A Comparative Study of Synthetic Data

Guan, Shuhao, Greene, Derek

arXiv.org Artificial IntelligenceAug-13-2024

This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.

correction, dataset, experiment, (17 more...)

arXiv.org Artificial Intelligence

2408.02253

Country:

South America > Argentina (0.04)
Europe > Norway > Eastern Norway > Oslo (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Utilize Transformers for translating Wikipedia category names

Ta, Hoang-Thang, La, Quoc Thang

arXiv.org Artificial IntelligenceAug-12-2024

On Wikipedia, articles are categorized to aid readers in navigating content efficiently. The manual creation of new categories can be laborious and time-intensive. To tackle this issue, we built language models to translate Wikipedia categories from English to Vietnamese with a dataset containing 15,000 English-Vietnamese category pairs. Subsequently, small to medium-scale Transformer pre-trained models with a sequence-to-sequence architecture were fine-tuned for category translation. The experiments revealed that OPUS-MT-en-vi surpassed other models, attaining the highest performance with a BLEU score of 0.73, despite its smaller model storage. We expect our paper to be an alternative solution for translation tasks with limited computer resources.

machine translation, transformer, translation, (15 more...)

arXiv.org Artificial Intelligence

2408.06124

Country:

Europe > Norway > Eastern Norway > Oslo (0.05)
Europe > Russia (0.05)
Asia > Vietnam > Lâm Đồng Province > Đà Lạt (0.05)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Language-Informed Beam Search Decoding for Multilingual Machine Translation

Yang, Yilin, Lee, Stefan, Tadepalli, Prasad

arXiv.org Artificial IntelligenceAug-11-2024

Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input. However, decoding multilingual NMT models commonly produces ``off-target'' translations -- yielding translation outputs not in the intended language. In this paper, we first conduct an error analysis of off-target translations for a strong multilingual NMT model and identify how these decodings are produced during beam search. We then propose Language-informed Beam Search (LiBS), a general decoding algorithm incorporating an off-the-shelf Language Identification (LiD) model into beam search decoding to reduce off-target translations. LiBS is an inference-time procedure that is NMT-model agnostic and does not require any additional parallel data. Results show that our proposed LiBS algorithm on average improves +1.1 BLEU and +0.9 BLEU on WMT and OPUS datasets, and reduces off-target rates from 22.9\% to 7.7\% and 65.8\% to 25.3\% respectively.

algorithm, arxiv preprint arxiv, translation, (12 more...)

arXiv.org Artificial Intelligence

2408.05738

Country:

Europe > France (0.04)
North America > United States > Oregon (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Paradigm Completion for Derivational Morphology

Cotterell, Ryan, Vylomova, Ekaterina, Khayrallah, Huda, Kirov, Christo, Yarowsky, David

arXiv.org Artificial IntelligenceAug-9-2024

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.

computational linguistic, morphology, proceedings, (10 more...)

arXiv.org Artificial Intelligence

1708.09151

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
North America > United States > California > San Diego County > San Diego (0.04)
(12 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

Add feedback

SCOI: Syntax-augmented Coverage-based In-context Example Selection for Machine Translation

Tang, Chenming, Wang, Zhixiang, Wu, Yunfang

arXiv.org Artificial IntelligenceAug-9-2024

In-context learning (ICL) greatly improves the performance of large language models (LLMs) on various down-stream tasks, where the improvement highly depends on the quality of demonstrations. In this work, we introduce syntactic knowledge to select better in-context examples for machine translation (MT). We propose a new strategy, namely Syntax-augmented COverage-based In-context example selection (SCOI), leveraging the deep syntactic structure beyond conventional word matching. Specifically, we measure the set-level syntactic coverage by computing the coverage of polynomial terms with the help of a simplified tree-to-polynomial algorithm, and lexical coverage using word overlap. Furthermore, we devise an alternate selection approach to combine both coverage measures, taking advantage of syntactic and lexical information. We conduct experiments with two multi-lingual LLMs on six translation directions. Empirical results show that our proposed SCOI obtains the highest average COMET score among all learning-free methods, indicating that combining syntactic and lexical coverage successfully helps to select better in-context examples for MT.

computational linguistic, proceedings, translation, (16 more...)

arXiv.org Artificial Intelligence

2408.04872

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(6 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Government > Voting & Elections (0.46)
Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs

Oshika, Masashi, Morishita, Makoto, Hirao, Tsutomu, Sasano, Ryohei, Takeda, Koichi

arXiv.org Artificial IntelligenceAug-8-2024

In recent years, neural machine translation (NMT) has been widely used in everyday life. However, the current NMT lacks a mechanism to adjust the difficulty level of translations to match the user's language level. Additionally, due to the bias in the training data for NMT, translations of simple source sentences are often produced with complex words. In particular, this could pose a problem for children, who may not be able to understand the meaning of the translations correctly. In this study, we propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user's level. We achieve this by using large language models (LLMs), providing a triple of a source sentence, a translation, and a target word to be replaced. We create a benchmark dataset using back-translation on Simple English Wikipedia. The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words and, moreover, can iteratively replace most of the high-AoA words while still maintaining high BLEU and COMET scores.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2408.04217

Country: North America > United States (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback