Goto

Collaborating Authors

 marianmt



Translation Entropy: A Statistical Framework for Evaluating Translation Systems

arXiv.org Artificial Intelligence

The translation of written language has been known since the 3rd century BC; however, its necessity has become increasingly common in the information age. Today, many translators exist, based on encoder-decoder deep architectures, nevertheless, no quantitative objective methods are available to assess their performance, likely because the entropy of even a single language remains unknown. This study presents a quantitative method for estimating translation entropy, with the following key finding. Given a translator, several sentences that differ by only one selected token of a given pivot sentence yield identical translations. Analyzing the statistics of this phenomenon across an ensemble of such sentences, consisting each of a pivot selected token, yields the probabilities of replacing this specific token with others while preserving the translation. These probabilities constitute the entropy of the selected token, and the average across all selected pivot tokens provides an estimate of the translator's overall translation entropy, which is enhanced along the decoder blocks. This entropic measure allows for the quantitative ranking of several publicly available translators and reveals whether mutual translation entropy is symmetric. Extending the proposed method to include the replacement of two tokens in a given pivot sentence demonstrates a multiplicative effect, where translation degeneracy is proportional to the product of the degeneracies of the two tokens. These findings establish translation entropy as a measurable property and objective benchmarking of artificial translators. Results are based on MarianMT, T5-Base and NLLB-200 translators.


Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

arXiv.org Artificial Intelligence

WORK Although the findings highlight the effectiveness of fine - tuned transformer models for Bengali - Sylheti translation, several limitations remain. The dataset size (5,002 parallel sentences) restricts the models' capacity to generalize across diverse syntactic structures, stylistic variations, and domain - specific expressions. In addition, orthographic inconsistencies in Sylheti introduce noise, leading to training instability, particularly in models like mBART - 50. Another limitation is the reliance on automatic evaluation metrics such as BLEU and chrF, which may not fully capture the linguistic richness or cultural nuance of Sylheti. Future research should therefore focus on expanding the datas et through community - driven contributions and data augmentation strategies. Incorporating orthographic normalization could improve consistency and reduce variability during training. Hybrid approaches that combine the strengths of pre - trained LLMs with fin e - tuned NMT models may also enhance translation robustness in low - resource settings. Finally, incorporating human evaluation will provide a more comprehensive assessment of translation adequacy, fluency, and cultural alignment.


A Closed form Token level Decomposition

Neural Information Processing Systems

The typos do not affect related conclusions. For unsupervised LCG experiments, we use Y elp Reviews (Cho et al., 2018) and WMT News section Please refer to the official website of WMT dataset (Bojar et al., 2017) for more information about For MT experiments, we load the MarianMT from the es-en checkpoint provided by huggingface. All the hyperparameters are tuned on the development set. We simply report the results after the maximum number of training epochs (usually 20). For more implementation details and tricks, please refer to our code.


Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT

arXiv.org Artificial Intelligence

ORC ID: 0000-0003-2376-4591 Abstract Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR & speech recognition are utilized to transform the images and speech signals to text content. All these variety of mechanisms of text generation also introduce error into the captured text. This project aims at analyzing different kinds of errors that occurs in text documents. The work employs two of the advanced deep neural network based language models, namely, BART and MarianMT, for rectifying the anomalies present in text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both the models are able to bring down the erroneous sentences by 20+%, BART is able to handle spelling errors far better (24.6%) than grammatical errors (8.8%). I. Introduction Text is a natural representation of all the existing languages in the world. Texts help one express and communicate with others. Handwritten texts have been part of the history for ages, while digital texts have evolved to keep up with the rapidly growing technology in day to day lives. It is due to texts that one can extend from their knowledge and memory beyond their body into the environment around [1]. Text is available in various forms, from handwritten manuscripts to This is a pre-print version of the paper. Texts can be utilized for personal reasons such as diary entry, blog, etc., as well as for professional purposes like advertising, surveying, etc. Right from the newspaper one reads in the morning to the social media scrolling before going to bed, people are surrounded by text. It is human nature to categorize any kind of data they receive. As there is so much text available around, it is obvious that humans tend to inspect and review the text they require. It is the process of scanning the textual data in order to derive some meaning and store information. Most businesses rely on text analysis to extract valuable insights from various raw sources. The feedback received from these sources such as emails, chat messages, social media posts, comments & statements and survey responses help them in their decision-making strategies.