Collaborating Authors

Robust Subspace Outlier Detection in High Dimensional Space Machine Learning

Rare data in a large-scale database are called outliers that reveal significant information in the real world. The subspace-based outlier detection is regarded as a feasible approach in very high dimensional space. However, the outliers found in subspaces are only part of the true outliers in high dimensional space, indeed. The outliers hidden in normal-clustered points are sometimes neglected in the projected dimensional subspace. In this paper, we propose a robust subspace method for detecting such inner outliers in a given dataset, which uses two dimensional-projections: detecting outliers in subspaces with local density ratio in the first projected dimensions; finding outliers by comparing neighbor's positions in the second projected dimensions. Each point's weight is calculated by summing up all related values got in the two steps projected dimensions, and then the points scoring the largest weight values are taken as outliers. By taking a series of experiments with the number of dimensions from 10 to 10000, the results show that our proposed method achieves high precision in the case of extremely high dimensional space, and works well in low dimensional space.

K-NS: Section-Based Outlier Detection in High Dimensional Space Machine Learning

Finding rare information hidden in a huge amount of data from the Internet is a necessary but complex issue. Many researchers have studied this issue and have found effective methods to detect anomaly data in low dimensional space. However, as the dimension increases, most of these existing methods perform poorly in detecting outliers because of "high dimensional curse". Even though some approaches aim to solve this problem in high dimensional space, they can only detect some anomaly data appearing in low dimensional space and cannot detect all of anomaly data which appear differently in high dimensional space. To cope with this problem, we propose a new k-nearest section-based method (k-NS) in a section-based space. Our proposed approach not only detects outliers in low dimensional space with section-density ratio but also detects outliers in high dimensional space with the ratio of k-nearest section against average value. After taking a series of experiments with the dimension from 10 to 10000, the experiment results show that our proposed method achieves 100% precision and 100% recall result in the case of extremely high dimensional space, and better improvement in low dimensional space compared to our previously proposed method.

How To Detect Outliers In Dataset


Handle the outliers is biggest and challengeable task in Machine learning. An outlier is a data set that is distant from all other observations. A data points that lies outside the overall distribution of the dataset. Now, let understand with the help of example…. So, in salary column all employee's salaries fall under this range.

Identify, describe, plot, and remove the outliers from the dataset with R (rstats)


In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers. There are different methods to detect the outliers, including standard deviation approach and Tukey's method which use interquartile (IQR) range approach. In this post I will use the Tukey's method because I like that it is not dependent on distribution of data.

Anomaly/Outlier Detection using Local Outlier Factors


Outliers are patterns in data that do not confirm to the expected behavior. While detecting such patterns are of prime importance in Credit Card Fraud, Stock Trading etc. Detecting anomaly or outlier observations are also of importance when training any of the supervised machine learning models. This brings us to two very important questions: concept of a local outlier, and why a local outlier? In a multivariate dataset where the rows are generated independently from a probability distribution, only using centroid of the data might not alone be sufficient to tag all the outliers. Measures like Mahalanobis distance might be able to identify extreme observations but won't be able to label all possible outlier observations.