Data mining some times called knowledge discovery from data (KDD) is simply the discovery of patterns among data. The field has evolved into a science apart from being a module in Information Technology with its raising use cases in all fields. This article is the first of a series of articles where I've laid to publish on data mining starting from simple steps and moving towards concepts much deeper. In modern world, with social web data is generated at a very large rate. Data alone do not make any sence unless they are identified to be related in some manner (pattern).
Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and even being sued for wrongful analytic practice. In this paper, the traditional correlation is referred to as the weak correlation, as it captures only a small part of the association between two variables: weak correlation results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our strong correlation (with a value between 0 and 1) is high (say above 0.80) if not only the weak correlation is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter a (with a 0 corresponding to weak correlation, and a 1 being the recommended value for strong correlation).
Nonparametric correlations such as Spearman's rank correlation and Kendall's tau correlation are widely applied in scientific and engineering fields. This paper investigates the problem of computing nonparametric correlations on the fly for streaming data. Standard batch algorithms are generally too slow to handle real-world big data applications. They also require too much memory because all the data need to be stored in the memory before processing. This paper proposes a novel online algorithm for computing nonparametric correlations. The algorithm has O(1) time complexity and O(1) memory cost and is quite suitable for edge devices, where only limited memory and processing power are available. You can seek a balance between speed and accuracy by changing the number of cutpoints specified in the algorithm. The online algorithm can compute the nonparametric correlations 10 to 1,000 times faster than the corresponding batch algorithm, and it can compute them based either on all past observations or on fixed-size sliding windows.
Wang, Yisen (Tsinghua University) | Romano, Simone (University of Melbourne) | Nguyen, Vinh (University of Melbourne) | Bailey, James (University of Melbourne) | Ma, Xingjun (University of Melbourne) | Xia, Shu-Tao (Tsinghua University)
Correlation measures are a key element of statistics and machine learning, and essential for a wide range of data analysis tasks. Most existing correlation measures are for pairwise relationships, but real-world data can also exhibit complex multivariate correlations, involving three or more variables. We argue that multivariate correlation measures should be comparable, interpretable, scalable and unbiased. However, no existing measures satisfy all these requirements. In this paper, we propose an unbiased multivariate correlation measure, called UMC, which satisfies all the above criteria. UMC is a cumulative entropy based non-parametric multivariate correlation measure, which can capture both linear and non-linear correlations for groups of three or more variables. It employs a correction for chance using a statistical model of independence to address the issue of bias. UMC has high interpretability and we empirically show it outperforms state-of-the-art multivariate correlation measures in terms of statistical power, as well as for use in both subspace clustering and outlier detection tasks.
At TDWI's recent Executive Summit, Mark Madsen asked: is there a statistically significant correlation between sales of beer and sales of diapers or has the correlation been misused. At TDWI's Executive Summit in San Diego, Mark Madsen posed the provocative question: is there a statistically significant correlation between sales of beer and sales of diapers? Madsen, a research analyst with information management consultancy Third Nature, wasn't strictly interested in answering this question -- although he did. His presentation, aptly titled "Beer, Diapers, and Correlation: A Tale of Ambiguity," traced the origin and evolution of the claim that sales of beer and diapers are closely correlated. He wanted to look at the ways this claimed correlation has been used -- and misused -- since its discovery.