AITopics

2304.09919

Country:

Oceania > Papua New Guinea (0.25)
North America > United States > California (0.14)
Africa > Niger (0.06)
(19 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Artemova, Ekaterina, Plank, Barbara

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

arXiv.org Artificial IntelligenceApr-19-2023

Bilingual word lexicons are crucial tools for multilingual natural language understanding and machine translation tasks, as they facilitate the mapping of words in one language to their synonyms in another language. To achieve this, numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, using a typical pipeline consisting of two unsupervised steps: bitext mining and word alignment, both of which rely on pre-trained large language models~(LLMs). In this paper, we present an analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses several unique challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects. To evaluate the BLI outputs, we analyze them with respect to word frequency and pairwise edit distance. Additionally, we release two evaluation datasets comprising 1,500 bilingual sentence pairs and 1,000 bilingual word pairs. They were manually judged for their semantic similarity for each Bavarian-German and Alemannic-German language pair.

computational linguistic, large language model, natural language, (18 more...)

2304.09957

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(19 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)

Jha, Akshita, Reddy, Chandan K.

CodeAttack: Code-Based Adversarial Attacks for Pre-trained Programming Language Models

Pre-trained programming language (PL) models (such as CodeT5, CodeBERT, GraphCodeBERT, etc.,) have the potential to automate software engineering tasks involving code understanding and code generation. However, these models operate in the natural channel of code, i.e., they are primarily concerned with the human understanding of the code. They are not robust to changes in the input and thus, are potentially susceptible to adversarial attacks in the natural channel. We propose, CodeAttack, a simple yet effective black-box attack model that uses code structure to generate effective, efficient, and imperceptible adversarial code samples and demonstrates the vulnerabilities of the state-of-the-art PL models to code-specific adversarial attacks. We evaluate the transferability of CodeAttack on several code-code (translation and repair) and code-NL (summarization) tasks across different programming languages. CodeAttack outperforms state-of-the-art adversarial NLP attack models to achieve the best overall drop in performance while being more efficient, imperceptible, consistent, and fluent. The code can be found at https://github.com/reddy-lab-code-research/CodeAttack.

artificial intelligence, machine learning, natural language, (21 more...)

2206.00052

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (0.91)
Government > Military (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Gumma, Varun, Dabre, Raj, Kumar, Pratyush

An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models

Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically nonexistent, despite the popularity and superiority of MNMT. This paper bridges this gap by presenting an empirical investigation of knowledge distillation for compressing MNMT models. We take Indic to English translation as a case study and demonstrate that commonly used language-agnostic and language-aware KD approaches yield models that are 4-5x smaller but also suffer from performance drops of up to 3.5 BLEU. To mitigate this, we then experiment with design considerations such as shallower versus deeper models, heavy parameter sharing, multi-stage training, and adapters. We observe that deeper compact models tend to be as good as shallower non-compact ones, and that fine-tuning a distilled model on a High-Quality subset slightly boosts translation quality. Overall, we conclude that compressing MNMT models via KD is challenging, indicating immense scope for further research.

artificial intelligence, computational linguistic, natural language, (15 more...)

2304.09388

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(9 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

Chung, Hyung Won, Constant, Noah, Garcia, Xavier, Roberts, Adam, Tay, Yi, Narang, Sharan, Firat, Orhan

Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.

computational linguistic, large language model, machine learning, (20 more...)

2304.09151

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(9 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
(2 more...)

Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models

Liu, Jiawei, Kang, Yangyang, Tang, Di, Song, Kaisong, Sun, Changlong, Wang, Xiaofeng, Lu, Wei, Liu, Xiaozhong

Neural text ranking models have witnessed significant advancement and are increasingly being deployed in practice. Unfortunately, they also inherit adversarial vulnerabilities of general neural models, which have been detected but remain underexplored by prior studies. Moreover, the inherit adversarial vulnerabilities might be leveraged by blackhat SEO to defeat better-protected search engines. In this study, we propose an imitation adversarial attack on black-box neural passage ranking models. We first show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates and then train a ranking imitation model. Leveraging the ranking imitation model, we can elaborately manipulate the ranking results and transfer the manipulation attack to the target ranking model. For this purpose, we propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers, which causes premeditated disorderliness with very few tokens. To equip the trigger camouflages, we add the next sentence prediction loss and the language model fluency constraint to the objective function. Experimental results on passage ranking demonstrate the effectiveness of the ranking imitation attack model and adversarial triggers against various SOTA neural ranking models. Furthermore, various mitigation analyses and human evaluation show the effectiveness of camouflages when facing potential mitigation approaches. To motivate other scholars to further investigate this novel and important problem, we make the experiment data and code publicly available.

information retrieval, large language model, machine learning, (22 more...)

2209.06506

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(24 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Shalumov, Vitaly, Haskey, Harel

HeRo: RoBERTa and Longformer Hebrew Language Models

In this paper, we fill in an existing gap in resources available to the Hebrew NLP community by providing it with the largest so far pre-train dataset HeDC4, a state-of-the-art pre-trained language model HeRo for standard length inputs and an efficient transformer LongHeRo for long input sequences. The HeRo model was evaluated on the sentiment analysis, the named entity recognition, and the question answering tasks while the LongHeRo model was evaluated on the document classification task with a dataset composed of long documents. Both HeRo and LongHeRo presented state-of-the-art performance. The dataset and model checkpoints used in this work are publicly available.

artificial intelligence, natural language, text processing, (19 more...)

2304.11077

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)

arXiv.org Artificial IntelligenceApr-17-2023

VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Zhang, Zhen-Ru, Tan, Chuanqi, Huang, Songfang, Huang, Fei

Recent studies have demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. In addition to involving the masked language model objective, existing cross-lingual pre-training works leverage sentence-level contrastive learning or plugs in extra cross-attention module to complement the insufficient capabilities of cross-lingual alignment. Nonetheless, synonym pairs residing in bilingual corpus are not exploited and aligned, which is more crucial than sentence interdependence establishment for token-level tasks. In this work, we propose a cross-lingual pre-trained model VECO~2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. Then, token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance. Experiments show the effectiveness of the proposed strategy for cross-lingual model pre-training on the XTREME benchmark.

computational linguistic, machine learning, natural language, (19 more...)

2304.08205

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Dominican Republic (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Goyle, Vakul, Krishnaswamy, Parvathy, Ravikumar, Kannan Girija, Chattopadhyay, Utsa, Goyle, Kartikay

Neural Machine Translation For Low Resource Languages

arXiv.org Artificial IntelligenceApr-17-2023

Neural Machine translation is a challenging task due to the inherent complex nature and the fluidity that natural languages bring. Nonetheless, in recent years, it has achieved state-of-the-art performance in several language pairs. Although, a lot of traction can be seen in the areas of multilingual neural machine translation (MNMT) in the recent years, there are no comprehensive survey done to identify what approaches work well. The goal of this paper is to investigate the realm of low resource languages and build a Neural Machine Translation model to achieve state-of-the-art results. The paper looks to build upon the mBART language model and explore strategies to augment it with various NLP and Deep Learning techniques like back translation and transfer learning. This implementation tries to unpack the architecture of the NMT application and determine the different components which offers us opportunities to amend the said application within the purview of the low resource languages problem space.

machine learning, natural language, translation, (20 more...)

2304.07869

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(7 more...)

Genre:

Research Report (0.50)
Overview (0.34)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceApr-16-2023

SikuGPT: A Generative Pre-trained Model for Intelligent Information Processing of Ancient Texts from the Perspective of Digital Humanities

Chang, Liu, Dongbo, Wang, Zhixiao, Zhao, Die, Hu, Mengcheng, Wu, Litao, Lin, Si, Shen, Bin, Li, Jiangfeng, Liu, Hai, Zhang, Lianzheng, Zhao

The rapid advance in artificial intelligence technology has facilitated the prosperity of digital humanities research. Against such backdrop, research methods need to be transformed in the intelligent processing of ancient texts, which is a crucial component of digital humanities research, so as to adapt to new development trends in the wave of AIGC. In this study, we propose a GPT model called SikuGPT based on the corpus of Siku Quanshu. The model's performance in tasks such as intralingual translation and text classification exceeds that of other GPT-type models aimed at processing ancient texts. SikuGPT's ability to process traditional Chinese ancient texts can help promote the organization of ancient information and knowledge services, as well as the international dissemination of Chinese ancient culture.

large language model, machine learning, natural language, (21 more...)

2304.07778

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report > New Finding (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)