Rare data in a large-scale database are called outliers that reveal significant information in the real world. The subspace-based outlier detection is regarded as a feasible approach in very high dimensional space. However, the outliers found in subspaces are only part of the true outliers in high dimensional space, indeed. The outliers hidden in normal-clustered points are sometimes neglected in the projected dimensional subspace. In this paper, we propose a robust subspace method for detecting such inner outliers in a given dataset, which uses two dimensional-projections: detecting outliers in subspaces with local density ratio in the first projected dimensions; finding outliers by comparing neighbor's positions in the second projected dimensions. Each point's weight is calculated by summing up all related values got in the two steps projected dimensions, and then the points scoring the largest weight values are taken as outliers. By taking a series of experiments with the number of dimensions from 10 to 10000, the results show that our proposed method achieves high precision in the case of extremely high dimensional space, and works well in low dimensional space.
Digital media tend to combine text and images to express richer information, especially on image hosting and online shopping websites. This trend presents a challenge in understanding the contents from different forms of information. Features representing visual information are usually sparse in high dimensional space, which makes the learning process intractable. In order to understand text and its related visual information, we present a new graphical model-based approach to discover more meaningful information in rich media. We extend the standard Latent Dirichlet Allocation (LDA) framework to learn in high dimensional feature spaces.
The simplest answer is it classifies things by drawing a line between two groups of data points. From the linked wiki "More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks." The hyperplanes being the "lines" and a high dimensional space meaning you track a bunch of types of data about the things you want to classify. Say you want to know how likely it is to rain tomorrow based on what you know today. You could record the temperature, wind speed, humidity, time of the year, number of days since it last rained, etc.
This experiment gives you a peek into how machine learning works, by visualizing high-dimensional data. It's available for anyone to try on the web. It is also open-sourced as part of TensorFlow, so that coders can use these visualization techniques to explore their own data. Built by Daniel Smilkov, Fernanda Viégas, Martin Wattenberg, and the Big Picture team at Google.
Finding rare information hidden in a huge amount of data from the Internet is a necessary but complex issue. Many researchers have studied this issue and have found effective methods to detect anomaly data in low dimensional space. However, as the dimension increases, most of these existing methods perform poorly in detecting outliers because of "high dimensional curse". Even though some approaches aim to solve this problem in high dimensional space, they can only detect some anomaly data appearing in low dimensional space and cannot detect all of anomaly data which appear differently in high dimensional space. To cope with this problem, we propose a new k-nearest section-based method (k-NS) in a section-based space. Our proposed approach not only detects outliers in low dimensional space with section-density ratio but also detects outliers in high dimensional space with the ratio of k-nearest section against average value. After taking a series of experiments with the dimension from 10 to 10000, the experiment results show that our proposed method achieves 100% precision and 100% recall result in the case of extremely high dimensional space, and better improvement in low dimensional space compared to our previously proposed method.