AITopics

2401.0481

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > California (0.14)
(6 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

S, Selva Kumar, Khan, Afifah Khan Mohammed Ajmal, Manjeshwar, Chirag, Banday, Imadh Ajaz

Language Detection for Transliterated Content

arXiv.org Artificial IntelligenceJan-9-2024

In the contemporary digital era, the Internet functions as an unparalleled catalyst, dismantling geographical and linguistic barriers particularly evident in texting. This evolution facilitates global communication, transcending physical distances and fostering dynamic cultural exchange. A notable trend is the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages, posing a unique challenge for language technology in accurately detecting the source language. This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English utilizing BERT for language classification and Google Translate API for transliteration conversion. The research pioneers innovative approaches to identify and convert transliterated text, navigating challenges in the diverse linguistic landscape of digital communication. Emphasizing the pivotal role of comprehensive datasets for training Large Language Models LLMs like BERT, our model showcases exceptional proficiency in accurately identifying and classifying languages from transliterated text. With a validation accuracy of 99% our models robust performance underscores its reliability. The comprehensive exploration of transliteration dynamics supported by innovative approaches and cutting edge technologies like BERT, positions our research at the forefront of addressing unique challenges in the linguistic landscape of digital communication. Beyond contributing to language identification and transliteration capabilities this work holds promise for applications in content moderation, analytics and fostering a globally connected community engaged in meaningful dialogue.

communication, english alphabet, transliteration, (14 more...)

2401.04619

Country: Asia > India > Karnataka > Bengaluru (0.07)

Genre:

Overview > Innovation (0.76)
Research Report > Promising Solution (0.56)

Industry: Telecommunications (0.71)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.36)

The Atlantic - TechnologyJan-8-2024, 12:30:00 GMT

The Last Frontier of Machine Translation

When Google Translate was released, in 2006, I was an eighth grader stumbling through introductory Spanish, and my teacher had little reason to worry about her students using it to cheat. It's almost hard to remember now, but early machine-translation systems were laughably poor. They could give you the general thrust of, say, a Portuguese website, but they often failed at even basic tasks. In one case from 2010, a Google-translated summons reportedly instructed a defendant to avoid court instead of showing up there. Machine translation didn't become the juggernaut we know until 2015, when Baidu released its large-scale neural machine-translation system, built with the same basic architecture that chatbots such as ChatGPT use today. Google started switching from a statistical model to a neural system not long after, as did peers such as Systran and Microsoft Translator.

artificial intelligence, natural language, translation, (17 more...)

The Atlantic - Technology

Country:

North America > United States (0.05)
Europe > Netherlands (0.05)

Industry: Education > Educational Setting > K-12 Education (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training

Le, Khoi M., Pham, Trinh, Quan, Tho, Luu, Anh Tuan

Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT ($\textbf{L}$ow-rank $\textbf{A}$daptation for $\textbf{M}$ultilingual $\textbf{P}$araphrasing using $\textbf{A}$dversarial $\textbf{T}$raining), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/phkhanhtrinh23/LAMPAT.

adversarial training, dataset, proceedings, (10 more...)

2401.04348

Country:

Asia > Singapore (0.04)
North America > Dominican Republic (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry: Energy (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Vickers, Peter, Barrault, Loïc, Monti, Emilio, Aletras, Nikolaos

We Need to Talk About Classification Evaluation Metrics in NLP

In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.

accuracy, classifier, informedness, (14 more...)

2401.03831

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Ireland (0.04)
Europe > France (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)

Zouhar, Vilém, Bojar, Ondřej

Quality and Quantity of Machine Translation References for Automated Metrics

Automatic machine translation metrics often use human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

computational linguistic, proceedings, translation, (12 more...)

2401.01283

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Czechia (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Understanding Deep Gradient Leakage via Inversion Influence Functions

Zhang, Haobo, Hong, Junyuan, Deng, Yuyang, Mahdavi, Mehrdad, Zhou, Jiayu

Deep Gradient Leakage (DGL) is a highly effective attack that recovers private training images from gradient vectors. This attack casts significant privacy challenges on distributed learning from clients with sensitive data, where clients are required to share gradients. Defending against such attacks requires but lacks an understanding of when and how privacy leakage happens, mostly because of the black-box nature of deep networks. In this paper, we propose a novel Inversion Influence Function (I$^2$F) that establishes a closed-form connection between the recovered images and the private gradients by implicitly solving the DGL problem. Compared to directly solving DGL, I$^2$F is scalable for analyzing deep networks, requiring only oracle access to gradients and Jacobian-vector products. We empirically demonstrate that I$^2$F effectively approximated the DGL generally on different model architectures, datasets, modalities, attack implementations, and perturbation-based defenses. With this novel tool, we provide insights into effective gradient perturbation directions, the unfairness of privacy protection, and privacy-preferred model initialization. Our codes are provided in https://github.com/illidanlab/inversion-influence-function.

eigenvalue, gradient, privacy risk, (15 more...)

2309.13016

Country:

North America > United States > Michigan (0.04)
North America > United States > Pennsylvania (0.04)
South America > Peru > Lima Department > Lima Province > Lima (0.04)
(2 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

arXiv.org Artificial IntelligenceJan-7-2024

Building Efficient and Effective OpenQA Systems for Low-Resource Languages

Budur, Emrah, Özçelik, Rıza, Soylu, Dilara, Khattab, Omar, Güngör, Tunga, Potts, Christopher

Question answering (QA) is the task of answering questions posed in natural language with free-form natural language answers extracted from a given passage. In the OpenQA variant, only a question text is given, and the system must retrieve relevant passages from an unstructured knowledge source and use them to provide answers, which is the case in the mainstream QA systems on the Web. QA systems currently are mostly limited to the English language due to the lack of large-scale labeled QA datasets in non-English languages. In this paper, we show that effective, low-cost OpenQA systems can be developed for low-resource languages. The key ingredients are (1) weak supervision using machine-translated labeled datasets and (2) a relevant unstructured knowledge source in the target language. Furthermore, we show that only a few hundred gold assessment examples are needed to reliably evaluate these systems. We apply our method to Turkish as a challenging case study, since English and Turkish are typologically very distinct. We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA for Turkish. We obtain a performance improvement of 9-34% in the EM score and 13-33% in the F1 score compared to the BM25-based and DPR-based baseline QA reader models by using two versions of Wikipedia dumps spanning two years. Our results show that SQuAD-TR makes OpenQA feasible for Turkish, which we hope encourages researchers to build OpenQA systems in other low-resource languages. We make all the code, models, and the dataset publicly available.

dataset, retrieved, retriever, (16 more...)

2401.0359

Country:

Europe > United Kingdom (0.14)
North America > United States > Washington > King County > Seattle (0.14)
South America > Venezuela (0.04)
(34 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.93)
Leisure & Entertainment (0.93)
Health & Medicine (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Ambilduke, Kshitij, Shetye, Aneesh, Bagade, Diksha, Bhagwatkar, Rishika, Fitter, Khurshed, Vagdargi, Prasad, Chiddarwar, Shital

Enhancing Context Through Contrast

arXiv.org Artificial IntelligenceJan-6-2024

Considerable progress in learning such representations has been achieved by language modelling and mutual information maximization objectives using contrastive learning. The language-dependent nature of language modelling introduces a trade-off between the universality of the learned representations and the model's performance on the language modelling tasks. Although contrastive learning improves performance, its success cannot be attributed to mutual information alone. We propose a novel Context Enhancement step to improve performance on neural machine translation by maximizing mutual information using the Barlow Twins loss. Unlike other approaches, we do not explicitly augment the data but view languages as implicit augmentations, eradicating the risk of disrupting semantic information. Further, our method does not learn embeddings from scratch and can be generalised to any set of pre-trained embeddings. Finally, we evaluate the language-agnosticism of our embeddings through language classification and use them for neural machine translation to compare with state-of-the-art approaches.

neural machine translation, representation, translation, (16 more...)

2401.03314

Country:

Asia > India (0.05)
North America > United States > Maryland > Baltimore (0.04)

Genre:

Research Report (0.70)
Overview (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Pathak, Dhrubajyoti, Narzary, Sanjib, Nandi, Sukumar, Som, Bidisha

Part-of-Speech Tagger for Bodo Language using Deep Learning approach

arXiv.org Artificial IntelligenceJan-6-2024

Language Processing systems such as Part-of-speech tagging, Named entity recognition, Machine translation, Speech recognition, and Language modeling (LM) are well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model plays a vital role in the downstream tasks of modern NLP. Extensive studies are carried out on LMs for high-resource languages. Nevertheless, languages such as Bodo, Rabha, and Mising continue to lack coverage. In this study, we first present BodoBERT, a language model for the Bodo language. To the best of our knowledge, this work is the first such effort to develop a language model for Bodo. Secondly, we present an ensemble DL-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with CRF and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several language models in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

architecture, bodobert, experiment, (13 more...)

2401.03175

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)