AITopics

2411.04699

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Hong Kong (0.04)
(14 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Ng, Lynnette Hui Xian, Chan, Luo Qi

What talking you?: Translating Code-Mixed Messaging Texts to English

Translation of code-mixed texts to formal English allow a wider audience to understand these code-mixed languages, and facilitate downstream analysis applications such as sentiment analysis. In this work, we look at translating Singlish, which is colloquial Singaporean English, to formal standard English. Singlish is formed through the code-mixing of multiple Asian languages and dialects. We analysed the presence of other Asian languages and variants which can facilitate translation. Our dataset is short message texts, written as informal communication between Singlish speakers. We use a multi-step prompting scheme on five Large Language Models (LLMs) for language detection and translation. Our analysis show that LLMs do not perform well in this task, and we describe the challenges involved in translation of code-mixed languages. We also release our dataset in this link https://github.com/luoqichan/singlish.

large language model, machine learning, natural language, (17 more...)

2411.05253

Country:

Asia > Singapore (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Taiwan (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Findings of the IWSLT 2024 Evaluation Campaign

Ahmad, Ibrahim Said, Anastasopoulos, Antonios, Bojar, Ondřej, Borg, Claudia, Carpuat, Marine, Cattoni, Roldano, Cettolo, Mauro, Chen, William, Dong, Qianqian, Federico, Marcello, Haddow, Barry, Javorský, Dávid, Krubiński, Mateusz, Lam, Tsz Kin, Ma, Xutai, Mathur, Prashant, Matusov, Evgeny, Maurya, Chandresh, McCrae, John, Murray, Kenton, Nakamura, Satoshi, Negri, Matteo, Niehues, Jan, Niu, Xing, Ojha, Atul Kr., Ortega, John, Papi, Sara, Polák, Peter, Pospíšil, Adam, Pecina, Pavel, Salesky, Elizabeth, Sethiya, Nivedita, Sarkar, Balaram, Shi, Jiatong, Sikasote, Claytone, Sperber, Matthias, Stüker, Sebastian, Sudoh, Katsuhito, Thompson, Brian, Turchi, Marco, Waibel, Alex, Watanabe, Shinji, Wilken, Patrick, Zemánek, Petr, Zevallos, Rodolfo

This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

machine learning, natural language, translation, (18 more...)

2411.05088

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Czechia (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
(50 more...)

Genre: Research Report > Experimental Study (0.92)

Industry:

Leisure & Entertainment (0.94)
Education (0.68)
Media > Television (0.47)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zaranis, Emmanouil, Attanasio, Giuseppe, Agrawal, Sweta, Martins, André F. T.

Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation

The automatic assessment of translation quality has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. Although quality estimation (QE) metrics have been optimized to align with human judgments, no attention has been given to these metrics' potential biases, particularly in reinforcing visibility and usability for some demographic groups over others. This study is the first to investigate gender bias in QE metrics and its downstream impact on machine translation (MT). Focusing on out-of-English translations into languages with grammatical gender, we ask: Do contemporary QE metrics exhibit gender bias? Can the use of contextual information mitigate this bias? How does QE influence gender bias in MT outputs? Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. Masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Moreover, context-aware QE metrics reduce errors for masculine-inflected references but fail to address feminine referents, exacerbating gender disparities. Additionally, QE metrics can perpetuate gender bias in MT systems when used in quality-aware decoding. Our findings underscore the need to address gender bias in QE metrics to ensure equitable and unbiased MT systems.

computational linguistic, machine learning, natural language, (14 more...)

2410.10995

Country:

Europe > Portugal > Lisbon > Lisbon (0.14)
Asia > Singapore (0.05)
Asia > Thailand > Bangkok > Bangkok (0.04)
(19 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun

Song, Seyoung, Yoo, Haneul, Jin, Jiho, Cho, Kyunghyun, Oh, Alice

Historical and linguistic connections within the Sinosphere have led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These mixed results emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.

classical chinese, computational linguistic, translation, (15 more...)

2411.04822

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Vietnam (0.04)
North America > Canada > Ontario > Toronto (0.04)
(12 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Abdedaiem, Amin, Dahou, Abdelhalim Hafedh, Cheragui, Mohamed Amine, Mathiak, Brigitte

FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis

Building a corpus become an important topic in natural language processing (NLP) and especially for low resource languages (ex: AD), due to the importance that the corpus plays in the development of several tools, such as: Machine Translation Babaali and Salem [2022], Part of speech tagging Chiche and Yitagesu [2022], Named entities recognition Jarrar et al. [2022], etc. in particular with the emergence of techniques based on statistics, machine learning and deep learning. Who exploits this mass of information to develop, train and evaluate models. However, building a corpus is not an easy task Bakari et al. [2016]; it is extremely time-consuming and requires a lot of work, for the good reason that the volume and quality of the corpus are two important parameters. Despite the recent emergence of techniques that consume fewer resources, such as few-shot learning Tunstall et al. [2022]. Over the last few years, a lot of studies in NLP have focused on languages or variants of languages called low resources Mengoni and Santucci [2023]. This change of direction is mainly due to the emergence of social media such as Facebook, Twitter, RenRen, LinkedIn, Google+, and Tuenti, as a means of communication where people exchange messages and comments.

algerian dialect, corpus, dialect, (14 more...)

doi: 10.1016/j.procs.2024.10.214

2411.04604

Country:

Africa > Middle East > Algeria > Adrar Province > Adrar (0.04)
Europe > Germany (0.04)
North America > United States (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Media > News (0.86)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

arXiv.org Artificial IntelligenceNov-6-2024

From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

Zhang, Charles, Peng, Benji, Sun, Xintian, Niu, Qian, Liu, Junyu, Chen, Keyu, Li, Ming, Feng, Pohsun, Bi, Ziqian, Liu, Ming, Zhang, Yichao, Fei, Cheng, Yin, Caitlyn Heqi, Yan, Lawrence KQ, Wang, Tianyang

Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.

information retrieval, large language model, machine learning, (19 more...)

2411.05036

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Texas (0.04)
North America > Canada (0.04)
(5 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Malinga, Melusi, Lupanda, Isaac, Nkongolo, Mike Wa, van Deventer, Phil

A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI

arXiv.org Artificial IntelligenceNov-6-2024

South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.

sentiment, sentiment analysis, sentiment score, (16 more...)

2411.04316

Country:

Africa > Democratic Republic of the Congo (0.54)
Africa > South Africa > Gauteng > Pretoria (0.04)
Europe > Switzerland (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(5 more...)

Houbre, Mael, Boudin, Florian, Daille, Beatrice, Aizawa, Akiko

Self-Compositional Data Augmentation for Scientific Keyphrase Generation

arXiv.org Artificial IntelligenceNov-6-2024

State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.

computational linguistic, keyphrase, proceedings, (13 more...)

doi: 10.1145/3677389.3702504

2411.03039

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(26 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Beręsewicz, Maciej, Wydmuch, Marek, Cherniaiev, Herman, Pater, Robert

Multilingual hierarchical classification of job advertisements for job vacancy statistics

arXiv.org Machine LearningNov-6-2024

The goal of this paper is to develop a multilingual classifier and conditional probability estimator of occupation codes for online job advertisements according in accordance with the International Standard Classification of Occupations (ISCO) extended with the Polish Classification of Occupations and Specializations (KZiS), which is analogous to the European Classification of Occupations. In this paper, we utilise a range of data sources, including a novel one, namely the Central Job Offers Database, which is a register of all vacancies submitted to Public Employment Offices. Their staff members code the vacancies according to the ISCO and KZiS. A hierarchical multi-class classifier has been developed based on the transformer architecture. The classifier begins by encoding the jobs found in advertisements to the widest 1-digit occupational group, and then narrows the assignment to a 6-digit occupation code. We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points, particularly for the hand-coded online job advertisements. Finally, a bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software. The open-source software is provided for the benefit of the official statistics community, with a particular focus on international comparability.

advertisement, classification, dataset, (16 more...)

arXiv.org Machine Learning

2411.03779

Country:

Europe > United Kingdom (0.28)
Europe > Poland > Greater Poland Province > Poznań (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry:

Marketing (1.00)
Education (0.92)
Government > Regional Government > Europe Government (0.46)

Technology:

Information Technology > Software (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)