Goto

Collaborating Authors

Identify, describe, plot, and remove the outliers from the dataset with R (rstats)

@machinelearnbot

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers. There are different methods to detect the outliers, including standard deviation approach and Tukey's method which use interquartile (IQR) range approach. In this post I will use the Tukey's method because I like that it is not dependent on distribution of data.


Identify, describe, plot, and remove the outliers from the dataset with R (rstats)

@machinelearnbot

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers. There are different methods to detect the outliers, including standard deviation approach and Tukey's method which use interquartile (IQR) range approach. In this post I will use the Tukey's method because I like that it is not dependent on distribution of data.


Identify, describe, plot, and remove the outliers from the dataset

#artificialintelligence

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers. There are different methods to detect the outliers, including standard deviation approach and Tukey's method which use interquartile (IQR) range approach. In this post I will use the Tukey's method because I like that it is not dependent on distribution of data.


Dixon's Q test for outlier identification

#artificialintelligence

I recently faced the impossible task to identify outliers in a dataset with very, very small sample sizes and Dixon's Q test caught my attention. Honestly, I am not a big fan of this statistical test, but since Dixon's Q-test is still quite popular in certain scientific fields (e.g., chemistry) that it is important to understand its principles in order to draw your own conclusion of the presented research data that you might stumble upon in research articles or scientific talks. Dixon's Q test [1] was "invented" as a convenient procedure to quickly identify outliers in datasets that only contains a small number of observations: typically 3 n \le 10. Although (at least in my opinion), the removal of outliers is a very questionable practice, this test is quite popular in the field of chemistry to "objectively" detect and reject outliers that are due to systematic errors by the experimentalist. In my opinion, the Dixon Q-test should only be used with great caution, since this simple statistic is based on the assumption that the data is normal distributed, which can be quite challenging to predict for small sample sizes (if no prior/additional information is provided).


Outlier Detection (Part 1): Univariate

#artificialintelligence

If you look closely, the last observation is clearly an outlier. Using the same dataset we're going to show that the last variable is detected as an outlier using Robust Statistics Classical statistical methods rely on (normality) assumptions, but even a single outlier can influence conclusions significantly and may lead to misleading results. Robust statistics produce also reliable results when data contains outliers and yield automatic outlier detection tools. "It is perfect to use both classical and robust methods routinely, and only worry when they differ enough to matter… But when they differ, you should think hard." Let's use the same dataset that we saw as an example in the beginning.