AITopics

2311.04192

Genre: Research Report > New Finding (0.75)

Technology:

Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceNov-7-2023

Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition

Jeon, Taehee, Yang, Bongseok, Kim, Changhwan, Lim, Yoonseob

We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system. Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs). Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA. This suggests that integrating morpheme type information can enhance language models' syntactic and semantic capabilities, indicating that adopting more linguistic insights can further improve performance beyond standard morphological analysis.

decomposition, morpheme, tokenization, (10 more...)

2311.03928

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ukraine (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceNov-7-2023

Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation

Singh, Pushpdeep

Neural Machine Translation (NMT) models are state-of-the-art for machine translation. However, these models are known to have various social biases, especially gender bias. Most of the work on evaluating gender bias in NMT has focused primarily on English as the source language. For source languages different from English, most of the studies use gender-neutral sentences to evaluate gender bias. However, practically, many sentences that we encounter do have gender information. Therefore, it makes more sense to evaluate for bias using such sentences. This allows us to determine if NMT models can identify the correct gender based on the grammatical gender cues in the source sentence rather than relying on biased correlations with, say, occupation terms. To demonstrate our point, in this work, we use Hindi as the source language and construct two sets of gender-specific sentences: OTSC-Hindi and WinoMT-Hindi that we use to evaluate different Hindi-English (HI-EN) NMT systems automatically for gender bias. Our work highlights the importance of considering the nature of language when designing such extrinsic bias evaluation datasets.

computational linguistic, gender, translation, (12 more...)

2311.03767

Country:

Europe > Italy > Tuscany > Florence (0.05)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

CBSiMT: Mitigating Hallucination in Simultaneous Machine Translation with Weighted Prefix-to-Prefix Training

Liu, Mengge, Zhang, Wen, Li, Xiang, Tian, Yanzhi, Guo, Yuhang, Luan, Jian, Wang, Bin, Chen, Shuoying

Simultaneous machine translation (SiMT) is a challenging task that requires starting translation before the full source sentence is available. Prefix-to-prefix framework is often applied to SiMT, which learns to predict target tokens using only a partial source prefix. However, due to the word order difference between languages, misaligned prefix pairs would make SiMT models suffer from serious hallucination problems, i.e. target outputs that are unfaithful to source inputs. Such problems can not only produce target tokens that are not supported by the source prefix, but also hinder generating the correct translation by receiving more source words. In this work, we propose a Confidence-Based Simultaneous Machine Translation (CBSiMT) framework, which uses model confidence to perceive hallucination tokens and mitigates their negative impact with weighted prefix-to-prefix training. Specifically, token-level and sentence-level weights are calculated based on model confidence and acted on the loss function. We explicitly quantify the faithfulness of the generated target tokens using the token-level weight, and employ the sentence-level weight to alleviate the disturbance of sentence pairs with serious word order differences on the model. Experimental results on MuST-C English-to-Chinese and WMT15 German-to-English SiMT tasks demonstrate that our method can consistently improve translation quality at most latency regimes, with up to 2 BLEU scores improvement at low latency.

machine translation, proc, translation, (13 more...)

2311.03672

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

Song, Haiyue, Dabre, Raj, Chu, Chenhui, Fujita, Atsushi, Kurohashi, Sadao

Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we propose a dynamic programming based sentence alignment algorithm which leverages the cosine similarity of machine-translated sentences. The sentence alignment F1 score reaches 96%, which is higher than using the BERTScore, LASER, or sentBERT methods. For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets through manual filtering for benchmarking translation performance. Through machine translation experiments, we show that the mined corpora enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora via multistage fine-tuning. Furthermore, this study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits. For the sake of reproducibility, we have released the corpora as well as the code to create them. The dataset is available at https://github.com/shyyhs/CourseraParallelCorpusMining.

computational linguistic, proceedings, translation, (15 more...)

2311.03696

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(17 more...)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.48)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Findings of the WMT 2023 Shared Task on Discourse-Level Literary Translation: A Fresh Orb in the Cosmos of LLMs

Wang, Longyue, Tu, Zhaopeng, Gu, Yan, Liu, Siyou, Yu, Dian, Ma, Qingsong, Lyu, Chenyang, Zhou, Liting, Liu, Chao-Hong, Ma, Yufeng, Chen, Weiyu, Graham, Yvette, Webber, Bonnie, Koehn, Philipp, Way, Andy, Yuan, Yulin, Shi, Shuming

Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the first edition of the Discourse-Level Literary Translation. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 14 submissions from 7 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at http://www2.statmt.org/wmt23/literary-translation-task.html.

evaluation, machine translation, translation, (12 more...)

2311.03127

Country:

North America > United States > California (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.14)
Asia > China > Heilongjiang Province > Harbin (0.04)
(7 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Cotroneo, Domenico, Improta, Cristina, Liguori, Pietro, Natella, Roberto

Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks

AI-based code generators have become pivotal in assisting developers in writing software starting from natural language (NL). However, they are trained on large amounts of data, often collected from unsanitized online sources (e.g., GitHub, HuggingFace). As a consequence, AI models become an easy target for data poisoning, i.e., an attack that injects malicious samples into the training data to generate vulnerable code. To address this threat, we investigate the security of AI code generators by devising a targeted data poisoning strategy. We poison the training data by injecting increasing amounts of code containing security vulnerabilities and assess the attack's success on different state-of-the-art models for code generation. Our study shows that AI code generators are vulnerable to even a small amount of poison. Notably, the attack success strongly depends on the model architecture and poisoning rate, whereas it is not influenced by the type of vulnerabilities. Moreover, since the attack does not impact the correctness of code generated by pre-trained models, it is hard to detect. Lastly, our work offers practical insights into understanding and potentially mitigating this threat.

data poisoning, poisoning, snippet, (13 more...)

2308.04451

Country:

North America > United States > District of Columbia > Washington (0.05)
Europe > Italy > Campania > Naples (0.04)
North America > United States > New York > New York County > New York City (0.04)
(11 more...)

Genre: Research Report > Experimental Study (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Agarwal, Milind, Alam, Md Mahfuz Ibn, Anastasopoulos, Antonios

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.

computational linguistic, language identification, proceedings, (14 more...)

2305.14263

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ukraine (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(28 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.92)

arXiv.org Artificial IntelligenceNov-5-2023

Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative Decoding

Zeng, Jiali, Meng, Fandong, Yin, Yongjing, Zhou, Jie

Contemporary translation engines built upon the encoder-decoder framework have reached a high level of development, while the emergence of Large Language Models (LLMs) has disrupted their position by offering the potential for achieving superior translation quality. Therefore, it is crucial to understand in which scenarios LLMs outperform traditional NMT systems and how to leverage their strengths. In this paper, we first conduct a comprehensive analysis to assess the strengths and limitations of various commercial NMT systems and MT-oriented LLMs. Our findings indicate that neither NMT nor MT-oriented LLMs alone can effectively address all the translation issues, but MT-oriented LLMs can serve as a promising complement to the NMT systems. Building upon these insights, we explore hybrid methods and propose Cooperative Decoding (CoDec), which treats NMT systems as a pretranslation model and MT-oriented LLMs as a supplemental solution to handle complex scenarios beyond the capability of NMT alone. The results on the WMT22 test sets and a newly collected test set WebCrawl demonstrate the effectiveness and efficiency of CoDec, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in machine translation.

mt-oriented llm, nmt system, translation, (13 more...)

2311.02851

Country:

Asia > China > Fujian Province > Xiamen (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(7 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology (0.93)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Æsøy, Kristoffer, Ozaki, Ana

Rule Learning as Machine Translation using the Atomic Knowledge Bank

arXiv.org Artificial IntelligenceNov-5-2023

Machine learning models, and in particular language models, are being applied to various tasks that require reasoning. While such models are good at capturing patterns their ability to reason in a trustable and controlled manner is frequently questioned. On the other hand, logic-based rule systems allow for controlled inspection and already established verification methods. However it is well-known that creating such systems manually is time-consuming and prone to errors. We explore the capability of transformers to translate sentences expressing rules in natural language into logical rules. We see reasoners as the most reliable tools for performing logical reasoning and focus on translating language into the format expected by such tools. We perform experiments using the DKET dataset from the literature and create a dataset for language to logic translation based on the Atomic knowledge bank.

atomic knowledge bank, machine translation, rule learning

2311.02765

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Association Learning (0.40)