AITopics

2301.01701

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Yolo County > Davis (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(7 more...)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

#artificialintelligenceJan-12-2023, 10:15:14 GMT

DeepL targets AI translation for enterprises with fresh $100 million

Check out all the on-demand sessions from the Intelligent Security Summit here. Seeking to target enterprise customers with AI language translation, Cologne, Germany-based DeepL announced a new funding raise that public reports estimate at well over $100 million. Language translation is an increasingly critical function for enterprises working across geographies and different demographics. Basic language translation capabilities have been available on for decades -- for example, services such as Google Translate. But the challenge has been enabling more advanced translation for business use cases that capture not just the literal meaning but the right tone and context.

artificial intelligence, machine translation, translation, (3 more...)

#artificialintelligence

Country: Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.29)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Salar, Sazan, Hassani, Hossein

A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

arXiv.org Artificial IntelligenceJan-12-2023

Named Entity Recognition (NER) is one of the essential applications of Natural Language Processing (NLP). It is also an instrument that plays a significant role in many other NLP applications, such as Machine Translation (MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is an under-resourced language from the NLP perspective. Particularly, in all the categories, the lack of NER resources hinders other aspects of Kurdish processing. In this work, we present a data set that covers several categories of NEs in Kurdish (Sorani). The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit). It covers 11 categories and 33261 entries in total. The dataset is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/.

artificial intelligence, natural language, text processing, (18 more...)

2301.04962

Country:

Asia > Middle East > Iraq > Kurdistan Region > Duhok Governorate > Duhok (0.07)
Asia > Middle East > Iraq > Erbil Governorate > Erbil (0.07)

Genre: Research Report (0.40)

Industry: Government > Regional Government (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)

arXiv.org Artificial IntelligenceJan-10-2023

User-Centered Security in Natural Language Processing

Emmery, Chris

This dissertation proposes a framework of user-centered security in Natural Language Processing (NLP), and demonstrates how it can improve the accessibility of related research. Accordingly, it focuses on two security domains within NLP with great public interest. First, that of author profiling, which can be employed to compromise online privacy through invasive inferences. Without access and detailed insight into these models' predictions, there is no reasonable heuristic by which Internet users might defend themselves from such inferences. Secondly, that of cyberbullying detection, which by default presupposes a centralized implementation; i.e., content moderation across social platforms. As access to appropriate data is restricted, and the nature of the task rapidly evolves (both through lexical variation, and cultural shifts), the effectiveness of its classifiers is greatly diminished and thereby often misrepresented. Under the proposed framework, we predominantly investigate the use of adversarial attacks on language; i.e., changing a given input (generating adversarial samples) such that a given model does not function as intended. These attacks form a common thread between our user-centered security problems; they are highly relevant for privacy-preserving obfuscation methods against author profiling, and adversarial samples might also prove useful to assess the influence of lexical variation and augmentation on cyberbullying detection.

large language model, machine learning, text classification, (21 more...)

2301.0423

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Maryland > Baltimore (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(105 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.93)
Instructional Material > Course Syllabus & Notes (0.67)
(2 more...)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
(7 more...)

Dare, Megan, Diaz, Valentina Fajardo, So, Averie Ho Zoen, Wang, Yifan, Zhang, Shibingfeng

Unsupervised Mandarin-Cantonese Machine Translation

arXiv.org Artificial IntelligenceJan-10-2023

Advancements in unsupervised machine translation have enabled the development of machine translation systems that can translate between languages for which there is not an abundance of parallel data available. We explored unsupervised machine translation between Mandarin Chinese and Cantonese. Despite the vast number of native speakers of Cantonese, there is still no large-scale corpus for the language, due to the fact that Cantonese is primarily used for oral communication. The key contributions of our project include: 1. The creation of a new corpus containing approximately 1 million Cantonese sentences, and 2. A large-scale comparison across different model architectures, tokenization schemes, and embedding structures. Our best model trained with character-based tokenization and a Transformer architecture achieved a character-level BLEU of 25.1 when translating from Mandarin to Cantonese and of 24.4 when translating from Cantonese to Mandarin. In this paper we discuss our research process, experiments, and results.

cantonese, machine learning, natural language, (18 more...)

2301.03971

Country:

Asia > China > Guangdong Province (0.14)
Asia > China > Hong Kong (0.07)
Europe > Germany > Saarland (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Korakakis, Michalis, Vlachos, Andreas

Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation

arXiv.org Artificial IntelligenceJan-10-2023

Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT'14 and WMT'14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.

computational linguistic, machine learning, natural language, (15 more...)

2109.06308

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(20 more...)

Genre:

Research Report (0.64)
Instructional Material (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Automatic Standardization of Arabic Dialects for Machine Translation

Alnassan, Abidrabbo

Based on an annotated multimedia corpus, television series Mar{\=a}y{\=a} 2013, we dig into the question of ''automatic standardization'' of Arabic dialects for machine translation. Here we distinguish between rule-based machine translation and statistical machine translation. Machine translation from Arabic most of the time takes standard or modern Arabic as the source language and produces quite satisfactory translations thanks to the availability of the translation memories necessary for training the models. The case is different for the translation of Arabic dialects. The productions are much less efficient. In our research we try to apply machine translation methods to a dialect/standard (or modern) Arabic pair to automatically produce a standard Arabic text from a dialect input, a process we call ''automatic standardization''. we opt here for the application of ''statistical models'' because ''automatic standardization'' based on rules is more hard with the lack of ''diglossic'' dictionaries on the one hand and the difficulty of creating linguistic rules for each dialect on the other. Carrying out this research could then lead to combining ''automatic standardization'' software and automatic translation software so that we take the output of the first software and introduce it as input into the second one to obtain at the end a quality machine translation. This approach may also have educational applications such as the development of applications to help understand different Arabic dialects by transforming dialectal texts into standard Arabic.

artificial intelligence, machine translation, natural language, (15 more...)

2301.03447

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Malaysia (0.05)
Asia > Middle East > Syria > Damascus Governorate > Damascus (0.05)
(11 more...)

Genre: Research Report (0.70)

Industry:

Government (0.46)
Education (0.46)
Media (0.35)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Universal Multimodal Representation for Language Understanding

Zhang, Zhuosheng, Chen, Kehai, Wang, Rui, Utiyama, Masao, Sumita, Eiichiro, Li, Zuchao, Zhao, Hai

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation.

artificial intelligence, machine learning, natural language, (17 more...)

doi: 10.1109/TPAMI.2023.3234170

2301.03344

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Heilongjiang Province > Harbin (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(20 more...)

Genre: Research Report > New Finding (0.86)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Vandeghinste, Vincent, Guhr, Oliver

FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers

When applying automated speech recognition (ASR) for Belgian Dutch (Van Dyck et al. 2021), the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. As far as we know there is no publicly available punctuation insertion system for Dutch that functions at a usable level. The model we present here is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available. We trained a sequence classification model, based on the Dutch language model RobBERT (Delobelle et al. 2020). For every word in the input sequence, the models predicts a punctuation marker that follows the word. We have also extended a multilingual model, for cases where the language is unknown or where code switching applies. When performing the task of segmentation, the application of the best models onto out of domain test data, a sliding window of 200 words of the ASR output stream is sent to the classifier, and segmentation is applied when the system predicts a segmenting punctuation sign with a ratio above threshold. Results show to be much better than a machine translation baseline approach.

corpus, machine learning, natural language, (20 more...)

2301.03319

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
Europe > Slovenia (0.04)
(12 more...)

Genre:

Research Report > New Finding (0.48)
Instructional Material > Course Syllabus & Notes (0.46)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

State-of-the-art generalisation research in NLP: A taxonomy and review

Hupkes, Dieuwke, Giulianelli, Mario, Dankers, Verna, Artetxe, Mikel, Elazar, Yanai, Pimentel, Tiago, Christodoulopoulos, Christos, Lasri, Karim, Saphra, Naomi, Sinclair, Arabella, Ulmer, Dennis, Schottmann, Florian, Batsuren, Khuyagbaatar, Sun, Kaiser, Sinha, Koustuv, Khalatbari, Leila, Ryskina, Maria, Frieske, Rita, Cotterell, Ryan, Jin, Zhijing

The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what 'good generalisation' entails and how it should be evaluated is not well understood, nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to address both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they investigate, the type of data shift they consider, the source of this data shift, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis that maps out the current state of generalisation research in NLP, and we make recommendations for which areas might deserve attention in the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to update as new NLP generalisation studies are published. With this work, we aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.

large language model, machine learning, reinforcement learning, (26 more...)

2210.0305

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(39 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Media > News (1.00)
Education (1.00)
Information Technology > Security & Privacy (0.67)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(6 more...)