AITopics | pre-processing technique

Collaborating Authors

pre-processing technique

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Revisiting Pre-processing Group Fairness: A Modular Benchmarking Framework

Oldfield, Brodie, Xu, Ziqi, Kandanaarachchi, Sevvandi

arXiv.org Artificial IntelligenceAug-22-2025

As machine learning systems become increasingly integrated into high-stakes decision-making processes, ensuring fairness in algorithmic outcomes has become a critical concern. Methods to mitigate bias typically fall into three categories: pre-processing, in-processing, and post-processing. While significant attention has been devoted to the latter two, pre-processing methods, which operate at the data level and offer advantages such as model-agnosticism and improved privacy compliance, have received comparatively less focus and lack standardised evaluation tools. In this work, we introduce FairPrep, an extensible and modular benchmarking framework designed to evaluate fairness-aware pre-processing techniques on tabular datasets. Built on the AIF360 platform, FairPrep allows seamless integration of datasets, fairness interventions, and predictive models. It features a batch-processing interface that enables efficient experimentation and automatic reporting of fairness and utility metrics. By offering standardised pipelines and supporting reproducible evaluations, FairPrep fills a critical gap in the fairness benchmarking landscape and provides a practical foundation for advancing data-level fairness research.

artificial intelligence, data mining, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2508.15193

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

Add feedback

Evaluating LLaMA 3.2 for Software Vulnerability Detection

Gonçalves, José, Silva, Miguel, Cabral, Bernardo, Dias, Tiago, Maia, Eva, Praça, Isabel, Severino, Ricardo, Ferreira, Luís Lino

arXiv.org Artificial IntelligenceMar-10-2025

Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.

dataset, detection, vulnerability detection, (15 more...)

arXiv.org Artificial Intelligence

2503.0777

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Portugal > Porto > Porto (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Enhanced Text Compression Approach Using Transformer-based Language Models

Rahman, Chowdhury Mofizur, Sobhani, Mahbub E, Rodela, Anika Tasnim, Shatabda, Swakkhar

arXiv.org Artificial IntelligenceDec-14-2024

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TENSYMP61132.2024.10752239

2412.1525

Country:

North America > United States > Vermont (0.04)
Europe > Belgium (0.04)
Asia > Bangladesh (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Optimization Framework for Processing and Transfer Learning for the Brain Tumor Segmentation

Ren, Tianyi, Honey, Ethan, Rebala, Harshitha, Sharma, Abhishek, Chopra, Agamdeep, Kurt, Mehmet

arXiv.org Artificial IntelligenceFeb-10-2024

Tumor segmentation from multi-modal brain MRI images is a challenging task due to the limited samples, high variance in shapes and uneven distribution of tumor morphology. The performance of automated medical image segmentation has been significant improvement by the recent advances in deep learning. However, the model predictions have not yet reached the desired level for clinical use in terms of accuracy and generalizability. In order to address the distinct problems presented in Challenges 1, 2, and 3 of BraTS 2023, we have constructed an optimization framework based on a 3D U-Net model for brain tumor segmentation. This framework incorporates a range of techniques, including various pre-processing and post-processing techniques, and transfer learning. On the validation datasets, this multi-modality brain tumor segmentation framework achieves an average lesion-wise Dice score of 0.79, 0.72, 0.74 on Challenges 1, 2, 3 respectively.

loss function, segmentation, voxel, (14 more...)

arXiv.org Artificial Intelligence

2402.07008

Country:

North America > United States > Washington > King County > Seattle (0.14)
Africa > Sub-Saharan Africa (0.05)
Europe > Greece > Attica > Athens (0.04)
Europe > Belgium > Flanders (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.95)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

Nguyen, Quoc-Nam, Phan, Thang Chau, Nguyen, Duc-Vu, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceOct-28-2023

English and Chinese, known as resource-rich languages, have witnessed the strong development of transformer-based language models for natural language processing tasks. Although Vietnam has approximately 100M people speaking Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, performed well on general Vietnamese NLP tasks, including POS tagging and named entity recognition. These pre-trained language models are still limited to Vietnamese social media tasks. In this paper, we present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT, which is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese social media texts using XLM-R architecture. Moreover, we explored our pre-trained model on five important natural language downstream tasks on Vietnamese social media texts: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks. Our ViSoBERT model is available only for research purposes.

language model, pre-trained language model, visobert, (13 more...)

arXiv.org Artificial Intelligence

2310.11166

Country:

Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
(6 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Services (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Link Prediction for Wikipedia Articles as a Natural Language Inference Task

Phan, Chau-Thang, Nguyen, Quoc-Nam, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceSep-5-2023

Link prediction task is vital to automatically understanding the structure of large knowledge bases. In this paper, we present our system to solve this task at the Data Science and Advanced Analytics 2023 Competition "Efficient and Effective Link Prediction" (DSAA-2023 Competition) with a corpus containing 948,233 training and 238,265 for public testing. This paper introduces an approach to link prediction in Wikipedia articles by formulating it as a natural language inference (NLI) task. Drawing inspiration from recent advancements in natural language processing and understanding, we cast link prediction as an NLI task, wherein the presence of a link between two articles is treated as a premise, and the task is to determine whether this premise holds based on the information presented in the articles. We implemented our system based on the Sentence Pair Classification for Link Prediction for the Wikipedia Articles task. Our system achieved 0.99996 Macro F1-score and 1.00000 Macro F1-score for the public and private test sets, respectively. Our team UIT-NLP ranked 3rd in performance on the private test set, equal to the scores of the first and second places. Our code is publicly for research purposes.

link prediction, prediction, wikipedia, (11 more...)

arXiv.org Artificial Intelligence

2308.16469

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.05)
North America > United States > California (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
(3 more...)

Add feedback

Improving Sentiment Analysis By Emotion Lexicon Approach on Vietnamese Texts

Doan, An Long, Luu, Son T.

arXiv.org Artificial IntelligenceDec-3-2022

The sentiment analysis task has various applications in practice. In the sentiment analysis task, words and phrases that represent positive and negative emotions are important. Finding out the words that represent the emotion from the text can improve the performance of the classification models for the sentiment analysis task. In this paper, we propose a methodology that combines the emotion lexicon with the classification model to enhance the accuracy of the models. Our experimental results show that the emotion lexicon combined with the classification model improves the performance of models.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/IALP57159.2022.9961318

2210.02063

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
North America > United States > Louisiana (0.04)
(4 more...)

Genre:

Overview (0.68)
Research Report > New Finding (0.48)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Data pre-processing for Machine Learning in Python

#artificialintelligenceApr-25-2022, 14:54:55 GMT

Data Preprocessing refers to the steps applied to make data more suitable for data mining. In this course, we are going to focus on pre-processing techniques for machine learning. Pre-processing is the set of manipulations that transform a raw dataset to make it used by a machine learning model. It is necessary for making our data suitable for some machine learning models, to reduce the dimensionality, to better identify the relevant data, and to increase model performance. It's the most important part of a machine learning pipeline and it's strongly able to affect the success of a project.

data pre-processing, machine learning, pre-processing technique, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.44)

Add feedback

American Sign Language Identification Using Hand Trackpoint Analysis

Bajaj, Yugam, Malhotra, Puru

arXiv.org Artificial IntelligenceOct-24-2020

Sign Language helps people with Speaking and Hearing Disabilities communicate with others efficiently. Sign Language identification is a challenging area in the field of computer vision and recent developments have been able to achieve near perfect results for the task, though some challenges are yet to be solved. In this paper we propose a novel machine learning based pipeline for American Sign Language identification using hand track points. We convert a hand gesture into a series of hand track point coordinates that serve as an input to our system. In order to make the solution more efficient, we experimented with 28 different combinations of pre-processing techniques, each run on three different machine learning algorithms namely k-Nearest Neighbours, Random Forests and a Neural Network. Their performance was contrasted to determine the best pre-processing scheme and Algorithm Pair. Our system achieved an Accuracy of 95.66% to identify American sign language gestures.

artificial intelligence, machine learning, neural network, (13 more...)

arXiv.org Artificial Intelligence

2010.1059

Country:

Asia > India (0.05)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.64)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

📱Adversarial Attacks on SMS Spam Detectors

#artificialintelligenceAug-19-2020, 11:02:02 GMT

Note: The methodology behind the approach discussed in this post stems from a collaborative publication between myself and Irene Anthi. Spam SMS text messages often show up unexpectedly on our phone screens. That's aggravating enough, but it gets worse. Whoever is sending you a spam text message is usually trying to defraud you. Most spam text messages don't come from another phone.

adversarial sample, machine learning, natural language, (16 more...)

#artificialintelligence

Industry:

Telecommunications (1.00)
Information Technology > Security & Privacy (0.92)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.32)

Add feedback