AITopics | mbert

Collaborating Authors

mbert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Feng, Zixin, Cui, Xinying, Sun, Yifan, Wei, Zheng, Yuan, Jiachen, Hu, Jiazhen, Xin, Ning, Hasan, Md Maruf

arXiv.org Machine LearningMar-16-2026

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

artificial intelligence, detection, machine learning, (16 more...)

arXiv.org Machine Learning

2603.1292

Country:

Asia > China > Shaanxi Province > Xi'an (0.05)
North America > United States > Texas > El Paso County > El Paso (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.97)
Information Technology > Security & Privacy (0.97)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

7945ab41f2aada1247a7c95e75cdf6c8-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-15-2026, 03:24:18 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Qatar (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(19 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(4 more...)

Add feedback

Rethinking what Matters: Effective and Robust Multilingual Realignment for Low-Resource Languages

Nguyen, Quang Phuoc, Anugraha, David, Gaschi, Felix, Cheng, Jun Bin, Lee, En-Shiun Annie

arXiv.org Artificial IntelligenceNov-11-2025

Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.06497

Country:

North America > United States (1.00)
Europe (0.92)
Asia > Middle East > UAE (0.46)
North America > Canada > Ontario (0.28)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Add feedback

On Multilingual Encoder Language Model Compression for Low-Resource Languages

Gurgurov, Daniil, Gregor, Michal, van Genabith, Josef, Ostermann, Simon

arXiv.org Artificial IntelligenceNov-7-2025

In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% while maintaining competitive performance, with average drops of 2-10% for moderate compression and 8-13% at maximum compression in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct ablation studies to identify the best practices for multilingual model compression using these techniques.

artificial intelligence, computational linguistic, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.16956

Country:

Europe (1.00)
Asia (0.93)
North America > Mexico (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark

Neural Information Processing SystemsOct-8-2025, 22:52:37 GMT

This is particularly true for language tasks that are culture-dependent.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Qatar (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(19 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
(2 more...)

Add feedback

Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning

Abiola, T. O., Abiodun, K. D., Olumide, O. E., Adebanji, O. O., Calvo, O. Hiram, Sidorov, Grigori

arXiv.org Artificial IntelligenceSep-25-2025

Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.

artificial intelligence, machine learning, speech detection, (18 more...)

arXiv.org Artificial Intelligence

2509.20315

Country:

North America > Mexico (0.69)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach

Hossen, Md. Sabbir, Saiduzzaman, Md., Shaha, Pabon

arXiv.org Artificial IntelligenceAug-6-2025

The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.

machine learning, natural language, sentiment analysis, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ECAI65401.2025.11095452

2507.11084

Country: Asia > Bangladesh (0.72)

Genre: Research Report > New Finding (0.35)

Industry:

Information Technology (0.68)
Media > News (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdullah, Abdulhady Abas, Gandomi, Amir H., Rashid, Tarik A, Mirjalili, Seyedali, Abualigah, Laith, Živković, Milena, Veisi, Hadi

arXiv.org Artificial IntelligenceJul-28-2025

In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

arabic, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.18762

Country: Asia > Middle East (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings

Islam, Muhammad, Khan, Javed Ali, Abaker, Mohammed, Daud, Ali, Irshad, Azeem

arXiv.org Artificial IntelligenceJun-3-2025

--The rapid expansion of social media platforms has significantly increased the dissemination of forged content and misinformation, making the detection of fake news a critical area of research. Although fact-checking efforts predominantly focus on English-language news, there is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu. Advanced Fake News Detection (FND) techniques rely heavily on large, accurately labeled datasets. However, FND in under-resourced languages like Urdu faces substantial challenges due to the scarcity of extensive corpora and the lack of validated lexical resources. Current Urdu fake news datasets are often domain-specific and inaccessible to the public. They also lack human verification, relying mainly on unverified English-to-Urdu translations, which compromises their reliability in practical applications. This study highlights the necessity of developing reliable, expert-verified, and domain-independent Urdu-enhanced FND datasets to improve fake news detection in Urdu and other resource-constrained languages. This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis. We also evaluate this dataset using multiple state-of-the-art pre-trained large language models (LLMs), such as XLNet, mBERT, XLM-RoBERT a, RoBERT a, DistilBERT, and DeBERT a. Additionally, we propose a unified LLM model that outperforms the others with different embedding and feature extraction techniques. The performance of these models is compared based on accuracy, F1 score, precision, recall, and human judgment for vetting the sample results of news. The proposed model outperforms advanced machine learning and deep learning models previously used in the literature for fake news detection.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.01587

Country:

Asia (1.00)
Europe > United Kingdom (0.67)

Genre:

Overview (1.00)
Research Report > New Finding (0.68)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Token Masking Improves Transformer-Based Text Classification

Xu, Xianglong, Bowen, John, Taheri, Rojin

arXiv.org Artificial IntelligenceMay-20-2025

While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis -- across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) -- show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level smoothing acts as implicit ensembling.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.11746

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Africa (0.04)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.58)

Add feedback