AITopics

2412.03304

Country:

Europe > United Kingdom (0.14)
North America > Central America (0.14)
Asia > Middle East > Israel (0.04)
(54 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Law (1.00)
Government (1.00)
Education > Curriculum > Subject-Specific Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Real-Time Multilingual Sign Language Processing

Moryossef, Amit

Sign Language Processing (SLP) is an interdisciplinary field comprised of Natural Language Processing (NLP) and Computer Vision. It is focused on the computational understanding, translation, and production of signed languages. Traditional approaches have often been constrained by the use of gloss-based systems that are both language-specific and inadequate for capturing the multidimensional nature of sign language. These limitations have hindered the development of technology capable of processing signed languages effectively. This thesis aims to revolutionize the field of SLP by proposing a simple paradigm that can bridge this existing technological gap. We propose the use of SignWiring, a universal sign language transcription notation system, to serve as an intermediary link between the visual-gestural modality of signed languages and text-based linguistic representations. We contribute foundational libraries and resources to the SLP community, thereby setting the stage for a more in-depth exploration of the tasks of sign language translation and production. These tasks encompass the translation of sign language from video to spoken language text and vice versa. Through empirical evaluations, we establish the efficacy of our transcription method as a pivot for enabling faster, more targeted research, that can lead to more natural and accurate translations across a range of languages. The universal nature of our transcription-based paradigm also paves the way for real-time, multilingual applications in SLP, thereby offering a more inclusive and accessible approach to language technology. This is a significant step toward universal accessibility, enabling a wider reach of AI-driven language technologies to include the deaf and hard-of-hearing community.

large language model, machine learning, natural language, (25 more...)

2412.01991

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.13)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(49 more...)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)
Instructional Material (0.92)
Research Report > Promising Solution (0.67)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Ranathunga, Surangika, Sirithunga, Rumesh, Rathnayake, Himashi, De Silva, Lahiru, Aluthwala, Thamindu, Peramuna, Saman, Shekhar, Ravi

SiTSE: Sinhala Text Simplification Dataset and Evaluation

Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: https://github.com/brainsharks-fyp17/Sinhala-Text-Simplification-Dataset-and-Evaluation

large language model, machine learning, natural language, (17 more...)

2412.01293

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Sri Lanka (0.05)
Europe > United Kingdom > England > Essex (0.04)
(8 more...)

Genre: Research Report (0.84)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Qu, Zhi, Wang, Yiran, Ding, Chenchen, Tanaka, Hideki, Utiyama, Masao, Watanabe, Taro

Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.

artificial intelligence, machine translation, natural language, (18 more...)

2412.02101

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Niger (0.04)
Asia > Japan (0.04)
(11 more...)

Genre: Research Report > New Finding (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Artificial intelligence contribution to translation industry: looking back and forward

Shormani, Mohammed Q.

This study provides a comprehensive analysis of artificial intelligence (AI) contribution to translation industry (ACTI) research, synthesizing it over forty-one years from 1980-2024. 13220 articles were retrieved from three sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz., scientometric and thematic, focusing on cluster, subject categories, keywords, burstness, centrality and research centers as for the former. For the latter, we thematically review 18 articles, selected purposefully from the articles involved, centering on purpose, approach, findings, and contribution to ACTI future directions. The findings reveal that in the past AI contribution to translation industry was not rigorous, resulting in rule-based machine translation and statistical machine translation whose output was not satisfactory. However, the more AI develops, the more machine translation develops, incorporating Neural Networking Algorithms and (Deep) Language Learning Models like ChatGPT whose translation output has developed considerably. However, much rigorous research is still needed to overcome several problems encountering translation industry, specifically concerning low-source languages, multi-dialectical and free word order languages, and cultural and religious registers.

acti research, machine translation, translation, (13 more...)

2411.19855

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Marashian, Ali, Rice, Enora, Gessler, Luke, Palmer, Alexis, von der Wense, Katharina

From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation

arXiv.org Artificial IntelligenceDec-1-2024

Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.

artificial intelligence, natural language, translation, (15 more...)

2412.00966

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
Europe > Iceland (0.04)
(18 more...)

Genre: Research Report > New Finding (0.68)

Industry:

Government (0.68)
Law (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

arXiv.org Artificial IntelligenceNov-29-2024

Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design, with Focus on German Text Data

Hahn, Udo

We survey clinical document corpora, with focus on German textual data. Due to rigid data privacy legislation in Germany these resources, with only few exceptions, are stored in safe clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing where easy accessibility and reuse of data collections are common practice. Hence, alternative corpus designs have been examined to escape from this data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several other types of domain proxies have come up as substitutes for authentic clinical documents. Common instances of close proxies are medical journal publications, clinical therapy guidelines, drug labels, etc., more distant proxies include online encyclopedic medical articles or medical contents from social media channels. After PRISM-conformant screening of 359 hits from four bibliographic systems, 75 relevant documents were finally selected for this review and 59 distinct corpora were determined. We identified 24 real clinical corpora (from 40 publications) out of which only 5 are publicly distributable. 2 translations of real corpora and 3 synthetic ones complement the set of clinical corpora. 14 corpora were categorized as close domain proxies, 16 as distant ones. There is a clear divide between the large number of non-accessible authentic clinical German-language corpora and their publicly accessible substitutes: translated or synthetic, close or more distant proxies. So on first sight, the data bottleneck seems broken. Intuitively yet, differences in genre-specific writing style, wording and medical domain expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are.

large language model, machine learning, natural language, (20 more...)

2412.0023

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
(35 more...)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.88)

Industry:

Media (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.92)
(3 more...)

Billah, Syed Mohammed Mostaque, Subarna, Ateya Ahmed, Sarna, Sudipta Nandi, Wasit, Ahmad Shawkat, Fariha, Anika, Sushmit, Asif, Sadeque, Arig Yousuf

Towards Santali Linguistic Inclusion: Building the First Santali-to-English Translation Model using mT5 Transformer and Data Augmentation

arXiv.org Artificial IntelligenceNov-29-2024

Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family's Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.

artificial intelligence, machine learning, natural language, (16 more...)

2411.19726

Country:

Asia > Bangladesh (0.25)
Asia > Nepal (0.24)
Asia > India (0.24)
(6 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Ohmae, Shogo, Ohmae, Keiko

The brain versus AI: World-model-based versatile circuit computation underlying diverse functions in the neocortex and cerebellum

arXiv.org Artificial IntelligenceNov-29-2024

AI's significant recent advances using general-purpose circuit computations offer a potential window into how the neocortex and cerebellum of the brain are able to achieve a diverse range of functions across sensory, cognitive, and motor domains, despite their uniform circuit structures. However, comparing the brain and AI is challenging unless clear similarities exist, and past reviews have been limited to comparison of brain-inspired vision AI and the visual neocortex. Here, to enable comparisons across diverse functional domains, we subdivide circuit computation into three elements -- circuit structure, input/outputs, and the learning algorithm -- and evaluate the similarities for each element. With this novel approach, we identify wide-ranging similarities and convergent evolution in the brain and AI, providing new insights into key concepts in neuroscience. Furthermore, inspired by processing mechanisms of AI, we propose a new theory that integrates established neuroscience theories, particularly the theories of internal models and the mirror neuron system. Both the neocortex and cerebellum predict future world events from past information and learn from prediction errors, thereby acquiring models of the world. These models enable three core processes: (1) Prediction -- generating future information, (2) Understanding -- interpreting the external world via compressed and abstracted sensory information, and (3) Generation -- repurposing the future-information generation mechanism to produce other types of outputs. The universal application of these processes underlies the ability of the neocortex and cerebellum to accomplish diverse functions with uniform circuits. Our systematic approach, insights, and theory promise groundbreaking advances in understanding the brain.

large language model, machine learning, natural language, (23 more...)

2411.16075

Country:

Asia > Middle East > Jordan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Beijing > Beijing (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Suryanarayanan, Sanjay, Song, Haiyue, Khan, Mohammed Safi Ur Rahman, Kunchukuttan, Anoop, Khapra, Mitesh M., Dabre, Raj

Pralekha: An Indic Document Alignment Evaluation Benchmark

arXiv.org Artificial IntelligenceNov-28-2024

Mining parallel document pairs poses a significant challenge because existing sentence embedding models often have limited context windows, preventing them from effectively capturing document-level information. Another overlooked issue is the lack of concrete evaluation benchmarks comprising high-quality parallel document pairs for assessing document-level mining approaches, particularly for Indic languages. In this study, we introduce Pralekha, a large-scale benchmark for document-level alignment evaluation. Pralekha includes over 2 million documents, with a 1:2 ratio of unaligned to aligned pairs, covering 11 Indic languages and English. Using Pralekha, we evaluate various document-level mining approaches across three dimensions: the embedding models, the granularity levels, and the alignment algorithm. To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC). DAC demonstrates substantial improvements over baseline pooling approaches, particularly in noisy scenarios, achieving average gains of 20-30% in precision and 15-20% in F1 score. These results highlight DAC's effectiveness in parallel document mining for Indic languages.

artificial intelligence, machine learning, natural language, (18 more...)

2411.19096

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)