Machine Translation
From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation
Marashian, Ali, Rice, Enora, Gessler, Luke, Palmer, Alexis, von der Wense, Katharina
Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.
Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data
We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for authentic clinical documents. Common instances of close proxies are medical journal publications, clinical therapy guidelines, drug labels, etc., more distant proxies include online encyclopedic medical articles or medical contents from social media channels. After PRISM-conformant screening of 359 hits from four bibliographic systems, 75 relevant documents were finally selected for this review and 59 distinct corpora were determined. We identified 24 real clinical corpora (from 40 publications) out of which only 5 are publicly distributable. 2 translations of real corpora and 3 synthetic ones complement the set of clinical corpora. 14 corpora were categorized as close domain proxies, 16 as distant ones. There is a clear divide between the large number of non-accessible authentic clinical German-language corpora and their publicly accessible substitutes: translated or synthetic, close or more distant proxies. So on first sight, the data bottleneck seems broken. Intuitively yet, differences in genre-specific writing style, wording and medical domain expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are.
Towards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation
Billah, Syed Mohammed Mostaque, Subarna, Ateya Ahmed, Sarna, Sudipta Nandi, Wasit, Ahmad Shawkat, Fariha, Anika, Sushmit, Asif, Sadeque, Arig Yousuf
Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family's Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.
The brain versus AI: World-model-based versatile circuit computation underlying diverse functions in the neocortex and cerebellum
AI's significant recent advances using general-purpose circuit computations offer a potential window into how the neocortex and cerebellum of the brain are able to achieve a diverse range of functions across sensory, cognitive, and motor domains, despite their uniform circuit structures. However, comparing the brain and AI is challenging unless clear similarities exist, and past reviews have been limited to comparison of brain-inspired vision AI and the visual neocortex. Here, to enable comparisons across diverse functional domains, we subdivide circuit computation into three elements -- circuit structure, input/outputs, and the learning algorithm -- and evaluate the similarities for each element. With this novel approach, we identify wide-ranging similarities and convergent evolution in the brain and AI, providing new insights into key concepts in neuroscience. Furthermore, inspired by processing mechanisms of AI, we propose a new theory that integrates established neuroscience theories, particularly the theories of internal models and the mirror neuron system. Both the neocortex and cerebellum predict future world events from past information and learn from prediction errors, thereby acquiring models of the world. These models enable three core processes: (1) Prediction -- generating future information, (2) Understanding -- interpreting the external world via compressed and abstracted sensory information, and (3) Generation -- repurposing the future-information generation mechanism to produce other types of outputs. The universal application of these processes underlies the ability of the neocortex and cerebellum to accomplish diverse functions with uniform circuits. Our systematic approach, insights, and theory promise groundbreaking advances in understanding the brain.
Pralekha: An Indic Document Alignment Evaluation Benchmark
Suryanarayanan, Sanjay, Song, Haiyue, Khan, Mohammed Safi Ur Rahman, Kunchukuttan, Anoop, Khapra, Mitesh M., Dabre, Raj
Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.
MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences
Winata, Genta Indra, Anugraha, David, Susanto, Lucky, Kuwanto, Garry, Wijaya, Derry Tanti
Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.
Aligning Pre-trained Models for Spoken Language Translation
Sedláček, Šimon, Kesiraju, Santosh, Polok, Alexander, Černocký, Jan
This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.
Signs as Tokens: An Autoregressive Multilingual Sign Language Generator
Zuo, Ronglai, Potamias, Rolandos Alexandros, Ververas, Evangelos, Deng, Jiankang, Zafeiriou, Stefanos
Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at https://2000zrl.github.io/soke/.
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods
Suyunu, Burak, Taylan, Enes, Özgür, Arzucan
Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties. However, existing subword tokenization methods, developed primarily for human language, may be inadequate for protein sequences, which have unique patterns and constraints. This study evaluates three prominent tokenization approaches, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, across varying vocabulary sizes (400-6400), analyzing their effectiveness in protein sequence representation, domain boundary preservation, and adherence to established linguistic laws. Our comprehensive analysis reveals distinct behavioral patterns among these tokenizers, with vocabulary size significantly influencing their performance. BPE demonstrates better contextual specialization and marginally better domain boundary preservation at smaller vocabularies, while SentencePiece achieves better encoding efficiency, leading to lower fertility scores. WordPiece offers a balanced compromise between these characteristics. However, all tokenizers show limitations in maintaining protein domain integrity, particularly as vocabulary size increases. Analysis of linguistic law adherence shows partial compliance with Zipf's and Brevity laws but notable deviations from Menzerath's law, suggesting that protein sequences may follow distinct organizational principles from natural languages. These findings highlight the limitations of applying traditional NLP tokenization methods to protein sequences and emphasize the need for developing specialized tokenization strategies that better account for the unique characteristics of proteins.
Learning Monotonic Attention in Transducer for Streaming Generation
Ma, Zhengrui, Feng, Yang, Zhang, Min
Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forwardbackward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks. Unlike modern turn-based large language models, streaming models need to start generating the output before the input is completely read. Popular streaming generation methods can be broadly divided into two categories: Attentionbased Encoder-Decoder (AED; Bahdanau et al., 2015) and Transducer (Graves, 2012). Streaming AED models adapt the conventional sequence-to-sequence framework (Bahdanau, 2014) to support streaming generation. They often rely on an external policy module to determine the READ/WRITE actions in inference and to direct the scope of cross-attention in training. Examples include Wait-k policy (Ma et al., 2019) and monotonic attention-based methods (Raffel et al., 2017; Arivazhagan et al., 2019; Ma et al., 2020d; 2023a).