AITopics

2311.08306

Country:

North America > Canada > Ontario > Toronto (0.05)
Europe > Germany > Berlin (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
(19 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

WIREDMay-5-2024, 12:00:00 GMT

Translation Tech Is Amazing, Except When It's Not

Today's language translation apps are like self-driving cars: incredibly useful, promising, nearing maturity, and almost entirely powered by machines. It's astonishing that the technology even exists. Even so, machine translation is still clunky at times, if not awkward. Consider a recent conversation I had with my neighbor, Andre, who immigrated from Russia last year. Speaking little to no English, Andre is navigating the American Dream almost entirely through Google Translate, the most popular speech-to-speech translation app, first launched 10 years ago.

artificial intelligence, google translate, natural language, (7 more...)

WIRED

Country:

Europe > Russia (0.26)
Asia > Russia (0.26)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

An Active Inference Agent for Simulating Human Translation Processes in a Hierarchical Architecture: Integrating the Task Segment Framework and the HOF taxonomy

Carl, Michael

In this paper, we propose modelling human translation production as a hierarchy of three embedded translation processes. The proposed architecture replicates the temporal dynamics of keystroke production across sensorimotor, cognitive, and phenomenal layers. Utilizing data from the CRITT TPR-DB, the Task Segment Framework, and the HOF taxonomy, we demonstrate the temporal breakdown of the typing flow on distinct timelines within these three layers.

keystroke, task segment, translator, (16 more...)

2405.03111

Country:

Asia > India > Karnataka > Bengaluru (0.04)
South America > Colombia > Bogotá D.C. > Bogotá (0.04)
North America > United States > New York (0.04)
(2 more...)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.84)
Information Technology > Artificial Intelligence > Cognitive Science (0.82)

Öztürk, Şükrü, Keles, Hacer Yalim

E-TSL: A Continuous Educational Turkish Sign Language Dataset with Baseline Methods

This study introduces the continuous Educational Turkish Sign Language (E-TSL) dataset, collected from online Turkish language lessons for 5th, 6th, and 8th grades. The dataset comprises 1,410 videos totaling nearly 24 hours and includes performances from 11 signers. Turkish, an agglutinative language, poses unique challenges for sign language translation, particularly with a vocabulary where 64% are singleton words and 85% are rare words, appearing less than five times. We developed two baseline models to address these challenges: the Pose to Text Transformer (P2T-T) and the Graph Neural Network based Transformer (GNN-T) models. The GNN-T model achieved 19.13% BLEU-1 score and 3.28% BLEU-4 score, presenting a significant challenge compared to existing benchmarks. The P2T-T model, while demonstrating slightly lower performance in BLEU scores, achieved a higher ROUGE-L score of 22.09%. Additionally, we benchmarked our model using the well-known PHOENIX-Weather 2014T dataset to validate our approach.

dataset, sign language, translation, (15 more...)

2405.02984

Country:

Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report > New Finding (0.47)

Industry: Education > Curriculum > Subject-Specific Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Relay Decoding: Concatenating Large Language Models for Machine Translation

Fu, Chengpeng, Feng, Xiaocheng, Huang, Yichong, Huo, Wenshuai, Li, Baohang, Wang, Hui, Qin, Bin, Liu, Ting

Leveraging large language models for machine translation has demonstrated promising results. However, it does require the large language models to possess the capability of handling both the source and target languages in machine translation. When it is challenging to find large models that support the desired languages, resorting to continuous learning methods becomes a costly endeavor. To mitigate these expenses, we propose an innovative approach called RD (Relay Decoding), which entails concatenating two distinct large models that individually support the source and target languages. By incorporating a simple mapping layer to facilitate the connection between these two models and utilizing a limited amount of parallel data for training, we successfully achieve superior results in the machine translation task. Experimental results conducted on the Multi30k and WikiMatrix datasets validate the effectiveness of our proposed method.

language model, machine translation, translation, (10 more...)

2405.02933

Country:

Asia > Singapore (0.05)
Asia > China > Heilongjiang Province > Harbin (0.05)
North America > United States > Pennsylvania (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Sentiment Analysis Across Languages: Evaluation Before and After Machine Translation to English

Kathunia, Aekansh, Kaif, Mohammad, Arora, Nalin, Narotam, N

People communicate in more than 7,000 languages around the world, with around 780 languages spoken in India alone. Despite this linguistic diversity, research on Sentiment Analysis has predominantly focused on English text data, resulting in a disproportionate availability of sentiment resources for English. This paper examines the performance of transformer models in Sentiment Analysis tasks across multilingual datasets and text that has undergone machine translation. By comparing the effectiveness of these models in different linguistic contexts, we gain insights into their performance variations and potential implications for sentiment analysis across diverse languages. We also discuss the shortcomings and potential for future work towards the end.

bert 0, machine translation, sentiment analysis, (11 more...)

2405.02887

Country:

Asia > India (0.24)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Tōhoku (0.05)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)

arXiv.org Artificial IntelligenceMay-4-2024

On the Information Redundancy in Non-Autoregressive Translation

Wang, Zhihao, Wang, Longyue, Su, Jinsong, Yao, Junfeng, Tu, Zhaopeng

Token repetition is a typical form of multi-modal problem in fully non-autoregressive translation (NAT). In this work, we revisit the multi-modal problem in recently proposed NAT models. Our study reveals that these advanced models have introduced other types of information redundancy errors, which cannot be measured by the conventional metric - the continuous repetition ratio. By manually annotating the NAT outputs, we identify two types of information redundancy errors that correspond well to lexical and reordering multi-modality problems. Since human annotation is time-consuming and labor-intensive, we propose automatic metrics to evaluate the two types of redundant errors. Our metrics allow future studies to evaluate new methods and gain a more comprehensive understanding of their effectiveness.

nat model, redundancy, translation, (14 more...)

2405.02673

Country:

Asia > China > Fujian Province > Xiamen (0.04)
North America > United States > Maine (0.04)
Asia > Taiwan (0.04)

Genre: Research Report (0.40)

Industry:

Health & Medicine (0.68)
Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

arXiv.org Artificial IntelligenceMay-3-2024

The Call for Socially Aware Language Technologies

Yang, Diyi, Hovy, Dirk, Jurgens, David, Plank, Barbara

Language technologies have made enormous progress, especially with the introduction of large language models (LLMs). On traditional tasks such as machine translation and sentiment analysis, these models perform at near-human level. These advances can, however, exacerbate a variety of issues that models have traditionally struggled with, such as bias, evaluation, and risks. In this position paper, we argue that many of these issues share a common core: a lack of awareness of the factors, context, and implications of the social environment in which NLP operates, which we call social awareness. While NLP is getting better at solving the formal linguistic aspects, limited progress has been made in adding the social awareness required for language applications to work in all situations for all users. Integrating social awareness into NLP models will make applications more natural, helpful, and safe, and will open up new possibilities. Thus we argue that substantial challenges remain for NLP to develop social awareness and that we are just at the beginning of a new era for the field.

computational linguistic, proceedings, social awareness, (12 more...)

2405.02411

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Singapore (0.04)
(14 more...)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

arXiv.org Artificial IntelligenceMay-2-2024

UQA: Corpus for Urdu Question Answering

Arif, Samee, Farid, Sualeha, Athar, Awais, Raza, Agha Ali

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

computational linguistic, dataset, paragraph, (15 more...)

2405.01458

Country:

Asia > Pakistan > Punjab > Lahore Division > Lahore (0.05)
Europe > Italy > Tuscany > Florence (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(8 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Schäfer, Anton, Hofmann, Thomas, Schlag, Imanol, Pimentel, Tiago

On the Effect of (Near) Duplicate Subwords in Language Modelling

arXiv.org Artificial IntelligenceMay-2-2024

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.

duplicate, information, subword, (16 more...)

2404.06508

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
North America > Dominican Republic (0.04)
(5 more...)

Genre: Research Report > Experimental Study (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)