significant combination
Reviews: Finding significant combinations of features in the presence of categorical covariates
The paper is well written and clearly organized, especially where it motivates the problem and the goals of the paper. It is a novel point of view to aim to find all feature subsets for which a statistical association test rejects the null hypothesis, and thus allowing to correct for a confounding categorical covariate. The authors results in the scope of the paper demonstrates that FACS keeps the computational efficiency, statistical power and the ability to correct for multiple hypothesis testing of existing method. The introduction of the branch-and-bound algorithm for conditional statistical association tests is novel and well-explained. Along with the results, the authors provide good intuition behind the methods and the key aspects, such as the appropriate testability criterion for the CMH test and the pruning criterion for the CMH test.
SQL to SARIMAX: How I navigate the first time-series analysis personal project for my portfolio
The diagnostics plot for this particular model shows a decently good fit . When being used for prediction, it followed the real trend closely. And since our focus is on the estimates/coefficients of the bool_promotion variable, I considered this model good enough to be used in our analysis. As we can see from the model summary, our bool_promotion variable is significant, meaning it's showed to affect sales of grocery I at store 1, and in this case, positively. Having promotions added more than 500 units to the sales for this given combination. Having figured out the pipeline throughout these steps, I automated this process for other store-city-product combinations with auto_arima(), which helps us identify the best fit set of orders, record these orders, as well as coefficients. First, I created a helper function to identify the necessary parameters and train the auto_arima(). One parameter that appeared tricky to me was parameter m, which is the period for seasonal differencing.
Finding Significant Combinations of Continuous Features
Sugiyama, Mahito, Borgwardt, Karsten M.
This problem is relevant in a broad range of applications including natural language processing, statistical genetics, and healthcare. To date, this problem of feature selection (Guyon and Elisseeff, 2003) has been extensively studied in machine learning, including the recent advances in selective inference (Taylor and Tibshirani, 2015), a technique that can assess the statistical significance of features selected by linear models such as the Lasso (Lee et al., 2016). However, current approaches have a crucial limitation: They can only find single features or linear combinations of features, but it is still an open problem to find patterns, that is, combinations of features with multiplicative effect. A relevant line of research towards this goal is significant pattern mining (Llinares-López et al., 2015; Papaxanthos et al., 2016; Terada et al., 2013), which tries to find statistically associated feature combinations while controlling the family-wise error rate (FWER), that is, the probability to detect one or more false positive patterns. However, all existing methods for significant pattern mining only apply to combinations of binary or discrete features, and none of methods can handle real-valued data, although such data is common in many applications. If we binarize data beforehand to use significant pattern mining approaches, a binarization-based method cannot distinguish correlated and uncorrelated features (see Figure 1 for an example). Subgroup discovery (Atzmueller, 2015; Herrera et al., 2011; Novak et al., 2009) also has the same goal of finding associated feature combinations, but the existing methods are also designed for discrete data, which means that binarization is required (Grosskreutz and Rüping, 2009) for real-valued data and the above problem still exists. To date, there is no method that can find all combinations of continuous features that are significantly associated with an output variable and that accounts for the inherent multiple testing problem.