AITopics | zampieri

Collaborating Authors

zampieri

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Generalized Offensive Language Identification

Dmonte, Alphaeus, Arya, Tejas, Ranasinghe, Tharindu, Zampieri, Marcos

arXiv.org Artificial IntelligenceJul-26-2024

The prevalence of offensive content on the internet, encompassing hate speech and cyberbullying, is a pervasive issue worldwide. Consequently, it has garnered significant attention from the machine learning (ML) and natural language processing (NLP) communities. As a result, numerous systems have been developed to automatically identify potentially harmful content and to mitigate its impact. These systems can follow two approaches; (i) Use publicly available models and application endpoints, including prompting large language models (LLMs) (ii) Annotate datasets and train ML models on them. However, both approaches lack an understanding of how generalizable they are. Furthermore, the applicability of these systems is often questioned in off-domain and practical environments. This paper empirically evaluates the generalizability of offensive language detection models and datasets across a novel generalized benchmark: GenOffense. We answer three research questions on generalizability. Our findings will be useful in creating robust real-world offensive language detection systems.

dataset, genoffense, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2407.18738

Country: North America > United States > Virginia (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.48)
Health & Medicine > Therapeutic Area (0.46)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MultiLS-SP/CA: Lexical Complexity Prediction and Lexical Simplification Resources for Catalan and Spanish

Bott, Stefan, Saggion, Horacio, Rojas, Nelson Peréz, Salazar, Martin Solis, Ramirez, Saul Calderon

arXiv.org Artificial IntelligenceApr-11-2024

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents MultiLS-SP/CA, a novel dataset for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, MultiLS-SP is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we describe experiments with this dataset, which can serve as a baseline for future work on the same data.

dataset, saggion, simplification, (13 more...)

arXiv.org Artificial Intelligence

2404.07814

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Bulgaria > Varna Province > Varna (0.04)
North America > United States > Maryland (0.04)
(9 more...)

Genre: Research Report (0.40)

Industry:

Government (0.67)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Muted: Multilingual Targeted Offensive Speech Identification and Visualization

Tillmann, Christoph, Trivedi, Aashka, Rosenthal, Sara, Borse, Santosh, Zhang, Rong, Sil, Avirup, Bhattacharjee, Bishwaranjan

arXiv.org Artificial IntelligenceDec-18-2023

Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce Muted, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. Muted can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.

computational linguistic, dataset, span, (13 more...)

arXiv.org Artificial Intelligence

2312.11344

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Twitter BERT Approach for Offensive Language Detection in Marathi

Chavan, Tanmay, Patankar, Shantanu, Kane, Aditya, Gokhale, Omkar, Joshi, Raviraj

arXiv.org Artificial IntelligenceDec-20-2022

Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.10039

Country:

Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Maharashtra (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Predicting the Type and Target of Offensive Social Media Posts in Marathi

Zampieri, Marcos, Ranasinghe, Tharindu, Chaudhari, Mrinal, Gaikwad, Saurabh, Krishna, Prajwal, Nene, Mayuresh, Paygude, Shrunali

arXiv.org Artificial IntelligenceNov-22-2022

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2211.1257

Country:

North America > United States > New York > Monroe County > Rochester (0.04)
Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)
Asia > India > Maharashtra (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

Ranasinghe, Tharindu, North, Kai, Premasiri, Damith, Zampieri, Marcos

arXiv.org Artificial IntelligenceNov-18-2022

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable of identifying such content automatically. With the goal of carrying out a fair evaluation of these systems, several international competitions have been organized, providing the community with important benchmark data and evaluation methods for various languages. Organized since 2019, the HASOC (Hate Speech and Offensive Content Identification) shared task is one of these initiatives. In its fourth iteration, HASOC 2022 included three subtracks for English, Hindi, and Marathi. In this paper, we report the results of the HASOC 2022 Marathi subtrack which provided participants with a dataset containing data from Twitter manually annotated using the popular OLID taxonomy. The Marathi track featured three additional subtracks, each corresponding to one level of the taxonomy: Task A - offensive content identification (offensive vs. non-offensive); Task B - categorization of offensive types (targeted vs. untargeted), and Task C - offensive target identification (individual vs. group vs. others). Overall, 59 runs were submitted by 10 teams. The best systems obtained an F1 of 0.9745 for Subtrack 3A, an F1 of 0.9207 for Subtrack 3B, and F1 of 0.9607 for Subtrack 3C. The best performing algorithms were a mixture of traditional and deep learning approaches.

artificial intelligence, machine learning, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2211.10163

Country:

Asia > India (0.04)
North America > United States > Florida > Hillsborough County > University (0.04)
Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)
Asia > Singapore > Central Region > Singapore (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology > Services (1.00)
Health & Medicine (1.00)
Media > News (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Mandl, Thomas, Modha, Sandip, Shahi, Gautam Kishore, Madhu, Hiren, Satapara, Shrey, Majumder, Prasenjit, Schaefer, Johannes, Ranasinghe, Tharindu, Zampieri, Marcos, Nandini, Durgesh, Jaiswal, Amit Kumar

arXiv.org Artificial IntelligenceDec-16-2021

The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.

ceur-ws, information retrieval evaluation, working note, (11 more...)

arXiv.org Artificial Intelligence

2112.09301

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Asia > India > Chandigarh (0.05)
(16 more...)

Genre: Overview (1.00)

Industry:

Information Technology > Services (0.46)
Health & Medicine > Therapeutic Area (0.31)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Gaikwad, Saurabh, Ranasinghe, Tharindu, Zampieri, Marcos, Homan, Christopher M.

arXiv.org Artificial IntelligenceSep-8-2021

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

dataset, identification, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2109.03552

Country:

Asia > India (0.05)
North America > United States (0.04)
Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Morgan, Skye, Ranasinghe, Tharindu, Zampieri, Marcos

arXiv.org Artificial IntelligenceJul-30-2021

At the same time, social media sites have 2020). It is well-known that training large neural increasingly become more prone to offensive content transformer models often result in long processing (Hada et al., 2021; Zhu and Bhat, 2021; Bucur times. As GermEval-2021 features three related et al., 2021). As such, identifying the toxic language tasks, from a performance standpoint, we pose that in social media is a topic that has gained, training a model jointly on three tasks is likely to be and continues to gain traction. Research surrounding computationally more efficient than training three the problem of offensive content has centered models in isolation. Moreover, as GermEval-2021 around the application of computational models provides a single dataset for the three tasks, MTL that can identify various forms of negative content can also be used to help improving performance such as hate speech (Malmasi and Zampieri, 2018; across tasks. As such, we introduce multitask learning Nozza, 2021), abuse (Corazza et al., 2020), aggression whereby one model can predict all three tasks (Kumar et al., 2018, 2020), and cyber-bullying as an alternative approach.

marco zampieri, proceedings, zampieri, (12 more...)

arXiv.org Artificial Intelligence

2108.00057

Country:

Asia > India (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)

Genre: Research Report (0.82)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.48)
Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Ranasinghe, Tharindu, Sarkar, Diptanu, Zampieri, Marcos, Ororbia, Alex

arXiv.org Artificial IntelligenceApr-15-2021

In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an $0.68$ F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

marco zampieri, proceedings, zampieri, (10 more...)

arXiv.org Artificial Intelligence

2104.0463

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > West Midlands > Wolverhampton (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback