Goto

Collaborating Authors

 cleansing


Towards Semantic Noise Cleansing of Categorical Data based on Semantic Infusion

arXiv.org Artificial Intelligence

Semantic Noise affects text analytics activities for the domain-specific industries significantly. It impedes the text understanding which holds prime importance in the critical decision making tasks. In this work, we formalize semantic noise as a sequence of terms that do not contribute to the narrative of the text. We look beyond the notion of standard statistically-based stop words and consider the semantics of terms to exclude the semantic noise. We present a novel Semantic Infusion technique to associate meta-data with the categorical corpus text and demonstrate its near-lossless nature. Based on this technique, we propose an unsupervised text-preprocessing framework to filter the semantic noise using the context of the terms. Later we present the evaluation results of the proposed framework using a web forum dataset from the automobile-domain.


Data Preparation for Machine Learning: Cleansing, Transformation & Feature Engineering

#artificialintelligence

The purpose of the Data Preparation stage is to get the data into the best format for machine learning, this includes three stages: Data Cleansing, Data Transformation, and Feature Engineering. Quality data is more important than using complicated algorithms so this is an incredibly important step and should not be skipped. During the Data Understanding activities, you explored your data and detected incomplete or incorrect values. Most machine learning models require all features to be complete, therefore, missing values must be dealt with. The simplest solution is to remove all rows that have a missing value but important information could be lost or bias introduced.