AITopics | Machine Translation

Collaborating Authors

Machine Translation

"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains."
– Definition from the European Association for Machine Translation (EAMT).

You can translate text of your choice by using free translators such as: CAPITA, Google Translate, SDL International, SYSTRAN.

News Overviews Instructional Materials AI-Alerts Classics

From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation

Marashian, Ali, Rice, Enora, Gessler, Luke, Palmer, Alexis, von der Wense, Katharina

arXiv.org Artificial IntelligenceDec-1-2024

Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2412.00966

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
Europe > Iceland (0.04)
(18 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Government (0.68)
Law (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data

Hahn, Udo

arXiv.org Artificial IntelligenceNov-29-2024

We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for authentic clinical documents. Common instances of close proxies are medical journal publications, clinical therapy guidelines, drug labels, etc., more distant proxies include online encyclopedic medical articles or medical contents from social media channels. After PRISM-conformant screening of 359 hits from four bibliographic systems, 75 relevant documents were finally selected for this review and 59 distinct corpora were determined. We identified 24 real clinical corpora (from 40 publications) out of which only 5 are publicly distributable. 2 translations of real corpora and 3 synthetic ones complement the set of clinical corpora. 14 corpora were categorized as close domain proxies, 16 as distant ones. There is a clear divide between the large number of non-accessible authentic clinical German-language corpora and their publicly accessible substitutes: translated or synthetic, close or more distant proxies. So on first sight, the data bottleneck seems broken. Intuitively yet, differences in genre-specific writing style, wording and medical domain expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.0023

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
(35 more...)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.88)

Industry:

Media (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)
(3 more...)

Add feedback

Towards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation

Billah, Syed Mohammed Mostaque, Subarna, Ateya Ahmed, Sarna, Sudipta Nandi, Wasit, Ahmad Shawkat, Fariha, Anika, Sushmit, Asif, Sadeque, Arig Yousuf

arXiv.org Artificial IntelligenceNov-29-2024

Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family's Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.19726

Country:

Asia > Bangladesh (0.25)
Asia > Nepal (0.24)
Asia > India (0.24)
(6 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The brain versus AI: World-model-based versatile circuit computation underlying diverse functions in the neocortex and cerebellum

Ohmae, Shogo, Ohmae, Keiko

arXiv.org Artificial IntelligenceNov-29-2024

AI's significant recent advances using general-purpose circuit computations offer a potential window into how the neocortex and cerebellum of the brain are able to achieve a diverse range of functions across sensory, cognitive, and motor domains, despite their uniform circuit structures. However, comparing the brain and AI is challenging unless clear similarities exist, and past reviews have been limited to comparison of brain-inspired vision AI and the visual neocortex. Here, to enable comparisons across diverse functional domains, we subdivide circuit computation into three elements -- circuit structure, input/outputs, and the learning algorithm -- and evaluate the similarities for each element. With this novel approach, we identify wide-ranging similarities and convergent evolution in the brain and AI, providing new insights into key concepts in neuroscience. Furthermore, inspired by processing mechanisms of AI, we propose a new theory that integrates established neuroscience theories, particularly the theories of internal models and the mirror neuron system. Both the neocortex and cerebellum predict future world events from past information and learn from prediction errors, thereby acquiring models of the world. These models enable three core processes: (1) Prediction -- generating future information, (2) Understanding -- interpreting the external world via compressed and abstracted sensory information, and (3) Generation -- repurposing the future-information generation mechanism to produce other types of outputs. The universal application of these processes underlies the ability of the neocortex and cerebellum to accomplish diverse functions with uniform circuits. Our systematic approach, insights, and theory promise groundbreaking advances in understanding the brain.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2411.16075

Country:

Asia > Middle East > Jordan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Beijing > Beijing (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Add feedback

Pralekha: An Indic Document Alignment Evaluation Benchmark

Suryanarayanan, Sanjay, Song, Haiyue, Khan, Mohammed Safi Ur Rahman, Kunchukuttan, Anoop, Khapra, Mitesh M., Dabre, Raj

arXiv.org Artificial IntelligenceNov-28-2024

Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.19096

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

Winata, Genta Indra, Anugraha, David, Susanto, Lucky, Kuwanto, Garry, Wijaya, Derry Tanti

arXiv.org Artificial IntelligenceNov-28-2024

Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.02381

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
(4 more...)

Add feedback

Aligning Pre-trained Models for Spoken Language Translation

Sedláček, Šimon, Kesiraju, Santosh, Polok, Alexander, Černocký, Jan

arXiv.org Artificial IntelligenceNov-27-2024

This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.

connector, mt model, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2411.18294

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(14 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

Zuo, Ronglai, Potamias, Rolandos Alexandros, Ververas, Evangelos, Deng, Jiankang, Zafeiriou, Stefanos

arXiv.org Artificial IntelligenceNov-26-2024

Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at https://2000zrl.github.io/soke/.

dataset, sign language, tokenizer, (14 more...)

arXiv.org Artificial Intelligence

2411.17799

Country:

North America (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China (0.04)

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)

Add feedback

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Suyunu, Burak, Taylan, Enes, Özgür, Arzucan

arXiv.org Artificial IntelligenceNov-26-2024

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.

protein sequence, sentencepiece, vocabulary size, (10 more...)

arXiv.org Artificial Intelligence

2411.17669

Country:

North America > United States (0.14)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Learning Monotonic Attention in Transducer for Streaming Generation

Ma, Zhengrui, Feng, Yang, Zhang, Min

arXiv.org Artificial IntelligenceNov-26-2024

Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forwardbackward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks. Unlike modern turn-based large language models, streaming models need to start generating the output before the input is completely read. Popular streaming generation methods can be broadly divided into two categories: Attentionbased Encoder-Decoder (AED; Bahdanau et al., 2015) and Transducer (Graves, 2012). Streaming AED models adapt the conventional sequence-to-sequence framework (Bahdanau, 2014) to support streaming generation. They often rely on an external policy module to determine the READ/WRITE actions in inference and to direct the scope of cross-attention in training. Examples include Wait-k policy (Ma et al., 2019) and monotonic attention-based methods (Raffel et al., 2017; Arivazhagan et al., 2019; Ma et al., 2020d; 2023a).

computational linguistic, monoattn-transducer, translation, (12 more...)

arXiv.org Artificial Intelligence

2411.1717

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Quebec > Montreal (0.04)
(17 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.54)
(2 more...)

Add feedback