association measure
Absolute Neighbour Difference based Correlation Test for Detecting Heteroscedastic Relationships
It is a challenge to detect complicated data relationships thoroughly. Here, we propose a new statistical measure, named the absolute neighbour difference based neighbour correlation coefficient, to detect the associations between variables through examining the heteroscedasticity of the unpredictable variation of dependent variables. Different from previous studies, the new method concentrates on measuring nonfunctional relationships rather than functional or mixed associations. Either used alone or in combination with other measures, it enables not only a convenient test of heteroscedasticity, but also measuring functional and nonfunctional relationships separately that obviously leads to a deeper insight into the data associations. The method is concise and easy to implement that does not rely on explicitly estimating the regression residuals or the dependencies between variables so that it is not restrict to any kind of model assumption. The mechanisms of the correlation test are proved in theory and demonstrated with numerical analyses.
Sparse minimum Redundancy Maximum Relevance for feature selection
Naylor, Peter, Poignard, Benjamin, Climente-González, Héctor, Yamada, Makoto
We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where the parameters estimated as zero coefficients represent the set of inactive features. We establish the conditions under which zero coefficients are correctly identified to guarantee accurate recovery of inactive features. We introduce a multi-stage procedure based on the knockoff filter enabling the penalized mRMR to discard inactive features while controlling the false discovery rate (FDR). Our method performs comparably to HSIC-LASSO but is more conservative in the number of selected features. It only requires setting an FDR threshold, rather than specifying the number of features to retain. The effectiveness of the method is illustrated through simulations and real-world datasets. The code to reproduce this work is available on the following GitHub: https://github.com/PeterJackNaylor/SmRMR.
An Interpretable Measure for Quantifying Predictive Dependence between Continuous Random Variables -- Extended Version
Assunção, Renato, Figueiredo, Flávio, Júnior, Francisco N. Tinoco, de Sá-Freire, Léo M., Silva, Fábio
A fundamental task in statistical learning is quantifying the joint dependence or association between two continuous random variables. We introduce a novel, fully non-parametric measure that assesses the degree of association between continuous variables $X$ and $Y$, capable of capturing a wide range of relationships, including non-functional ones. A key advantage of this measure is its interpretability: it quantifies the expected relative loss in predictive accuracy when the distribution of $X$ is ignored in predicting $Y$. This measure is bounded within the interval [0,1] and is equal to zero if and only if $X$ and $Y$ are independent. We evaluate the performance of our measure on over 90,000 real and synthetic datasets, benchmarking it against leading alternatives. Our results demonstrate that the proposed measure provides valuable insights into underlying relationships, particularly in cases where existing methods fail to capture important dependencies.
Efficient Computation of Sparse and Robust Maximum Association Estimators
Pfeiffer, Pia, Alfons, Andreas, Filzmoser, Peter
Although robust statistical estimators are less affected by outlying observations, their computation is usually more challenging. This is particularly the case in high-dimensional sparse settings. The availability of new optimization procedures, mainly developed in the computer science domain, offers new possibilities for the field of robust statistics. This paper investigates how such procedures can be used for robust sparse association estimators. The problem can be split into a robust estimation step followed by an optimization for the remaining decoupled, (bi-)convex problem. A combination of the augmented Lagrangian algorithm and adaptive gradient descent is implemented to also include suitable constraints for inducing sparsity. We provide results concerning the precision of the algorithm and show the advantages over existing algorithms in this context. High-dimensional empirical examples underline the usefulness of this procedure. Extensions to other robust sparse estimators are possible.
Neural Bandits for Data Mining: Searching for Dangerous Polypharmacy
Larouche, Alexandre, Durand, Audrey, Khoury, Richard, Sirois, Caroline
Polypharmacy, most often defined as the simultaneous consumption of five or more drugs at once, is a prevalent phenomenon in the older population. Some of these polypharmacies, deemed inappropriate, may be associated with adverse health outcomes such as death or hospitalization. Considering the combinatorial nature of the problem as well as the size of claims database and the cost to compute an exact association measure for a given drug combination, it is impossible to investigate every possible combination of drugs. Therefore, we propose to optimize the search for potentially inappropriate polypharmacies (PIPs). To this end, we propose the OptimNeuralTS strategy, based on Neural Thompson Sampling and differential evolution, to efficiently mine claims datasets and build a predictive model of the association between drug combinations and health outcomes. We benchmark our method using two datasets generated by an internally developed simulator of polypharmacy data containing 500 drugs and 100 000 distinct combinations. Empirically, our method can detect up to 72% of PIPs while maintaining an average precision score of 99% using 30 000 time steps.
Discovering Association with Copula Entropy
Association (or dependence) is such a statistical tool defined for measuring the relationships between random variables of real systems [1]. Correlation, as the linear version of association, is the most commonly used one in real applications, while statistical dependence covers much broad types of associations including nonlinear cases than correlation does. Another closely related concept, Causality is defined for causal relationships in physical, social and biological systems. Even it is well known that association does not imply causation, association is still a necessary condition for causality in general. Association and causality are of significant importance in healthcare and medicine [1, 2]. In medical research, association is widely used as first evidence for scientific discoveries. Causality is much fundamental in all branches of medicine - clinicians diagnose based on symptom-disease relationships, pharmacologists find drugs according to drugs' effect on disease, epidemiology study how environmental factors affect Ma Jian is with Hitachi (China) Research & Development Corporation, Beijing 100084, China.
The mRMR variable selection method: a comparative study for functional data
Berrendero, José R., Cuevas, Antonio, Torrecilla, José L.
The use of variable selection methods is particularly appealing in statistical problems with functional data. The obvious general criterion for variable selection is to choose the `most representative' or `most relevant' variables. However, it is also clear that a purely relevance-oriented criterion could lead to select many redundant variables. The mRMR (minimum Redundance Maximum Relevance) procedure, proposed by Ding and Peng (2005) and Peng et al. (2005) is an algorithm to systematically perform variable selection, achieving a reasonable trade-off between relevance and redundancy. In its original form, this procedure is based on the use of the so-called mutual information criterion to assess relevance and redundancy. Keeping the focus on functional data problems, we propose here a modified version of the mRMR method, obtained by replacing the mutual information by the new association measure (called distance correlation) suggested by Sz\'ekely et al. (2007). We have also performed an extensive simulation study, including 1600 functional experiments (100 functional models $\times$ 4 sample sizes $\times$ 4 classifiers) and three real-data examples aimed at comparing the different versions of the mRMR methodology. The results are quite conclusive in favor of the new proposed alternative.
Nominal Association Vector and Matrix
Huang, Wenxue, Shi, Yong, Wang, Xiaogang
Nominal data are quite common in scientific and engineering research related to biomedical research, consumer behavior analysis, network analysis and search engine marketing optimization. When the population is cross-classified and there is no natural ordering for observed outcomes, association analysis as described in Han and Kamber (2006) can be described nominal association measures. Even if the categorical variables collected in these studies are ordinal, they are often treated as nominal if the ordering is not of interest or a natural and meaningful metric is difficult to establish. When the response variable is multinomial, the classical probabilistic measure such as odds ratio or relative risk are difficult to use due to the multiple 1 levels in the response variable. Instead, the principle of optimal (conditional mode based) or proportional (conditional Monte-Carlo based) prediction can be used to construct nonparametric nominal association measures. For example, Goodman-Kruskal (1954) and others proposed some local-to-global association measures towards optimal predictions. The proportional associations between variables are probabilistically and statistically intrinsic. It reflects the probabilistically averaging effects of input on output distributions. There are quite a few proportional association measures proposed in the literature (cf.