PIDForest: Anomaly Detection via Partial Identification
Gopalan, Parikshit, Sharan, Vatsal, Wieder, Udi
We consider the problem of detecting anomalies in a large dataset. We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.
Dec-7-2019
- Country:
- Europe
- Italy > Tuscany
- Pisa Province > Pisa (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Italy > Tuscany
- North America > United States
- California > Santa Clara County > Palo Alto (0.04)
- South America
- Europe
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology (0.46)
- Technology: