AITopics | visobert

Collaborating Authors

visobert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media

Nguyen, Dung Ha, Nguyen, Anh Thi Hoang, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceSep-30-2024

This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.

lexical normalization, normalization, weakly supervised data, (12 more...)

arXiv.org Artificial Intelligence

2409.20467

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Beijing > Beijing (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Services (1.00)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model

Nguyen, Luan Thanh

arXiv.org Artificial IntelligenceJun-4-2024

Recent advancements in hate speech detection (HSD) in Vietnamese have made significant progress, primarily attributed to the emergence of transformer-based pre-trained language models, particularly those built on the BERT architecture. However, the necessity for specialized fine-tuned models has resulted in the complexity and fragmentation of developing a multitasking HSD system. Moreover, most current methodologies focus on fine-tuning general pre-trained models, primarily trained on formal textual datasets like Wikipedia, which may not accurately capture human behavior on online platforms. In this research, we introduce ViHateT5, a T5-based model pre-trained on our proposed large-scale domain-specific dataset named VOZ-HSD. By harnessing the power of a text-to-text architecture, ViHateT5 can tackle multiple tasks using a unified model and achieve state-of-the-art performance across all standard HSD benchmarks in Vietnamese. Our experiments also underscore the significance of label distribution in pre-training data on model efficacy. We provide our experimental materials for research purposes, including the VOZ-HSD dataset, pre-trained checkpoint, the unified HSD-multitask ViHateT5 model, and related source code on GitHub publicly.

dataset, detection, speech detection, (13 more...)

arXiv.org Artificial Intelligence

2405.14141

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.46)

Industry:

Information Technology (0.46)
Media > News (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts

Vo, Cuong Nhat, Huynh, Khanh Bao, Luu, Son T., Do, Trong-Hop

arXiv.org Artificial IntelligenceApr-30-2024

The growth of social networks makes toxic content spread rapidly. Hate speech detection is a task to help decrease the number of harmful comments. With the diversity in the hate speech created by users, it is necessary to interpret the hate speech besides detecting it. Hence, we propose a methodology to construct a system for targeted hate speech detection from online streaming texts from social media. We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts. The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate. There are 5 targets in the dataset, and each target is labeled with the corresponding level manually by humans with strict annotation guidelines. The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level. Then, we construct a baseline for this task by combining the Bi-GRU-LSTM-CNN with the pre-trained language model to leverage the power of text representation of BERTology. Finally, we suggest a methodology to integrate the baseline model for targeted hate speech detection into the online streaming system for practical application in preventing hateful and offensive content on social media.

dataset, detection, speech detection, (15 more...)

arXiv.org Artificial Intelligence

2404.19252

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

Nguyen, Quoc-Nam, Phan, Thang Chau, Nguyen, Duc-Vu, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceOct-28-2023

English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.

language model, pre-trained language model, visobert, (13 more...)

arXiv.org Artificial Intelligence

2310.11166

Country:

Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(6 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Services (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback