McNicholas, Paul D.
Robust Asymmetric Clustering
Morris, Katherine, McNicholas, Paul D., Punzo, Antonio, Browne, Ryan P.
Contaminated mixture models are developed for model-based clustering of data with asymmetric clusters as well as spurious points, outliers, and/or noise. Specifically, we introduce a contaminated mixture of contaminated shifted asymmetric Laplace distributions and a contaminated mixture of contaminated skew-normal distributions. In each case, mixture components have a parameter controlling the proportion of bad points (i.e., spurious points, outliers, and/or noise) and one specifying the degree of contamination. A very important feature of our approaches is that these parameters do not have to be specified a priori. Expectation-conditional maximization algorithms are outlined for parameter estimation and the number of components is selected using the Bayesian information criterion. The performance of our approaches is illustrated on artificial and real data.
Families of Parsimonious Finite Mixtures of Regression Models
Dang, Utkarsh J., McNicholas, Paul D.
Finite mixtures of regression models offer a flexible framework for investigating heterogeneity in data with functional dependencies. These models can be conveniently used for unsupervised learning on data with clear regression relationships. We extend such models by imposing an eigen-decomposition on the multivariate error covariance matrix. By constraining parts of this decomposition, we obtain families of parsimonious mixtures of regressions and mixtures of regressions with concomitant variables. These families of models account for correlations between multiple responses. An expectation-maximization algorithm is presented for parameter estimation and performance is illustrated on simulated and real data.
Parsimonious Shifted Asymmetric Laplace Mixtures
Franczak, Brian C., McNicholas, Paul D., Browne, Ryan P., Murray, Paula M.
A family of parsimonious shifted asymmetric Laplace mixture models is introduced. We extend the mixture of factor analyzers model to the shifted asymmetric Laplace distribution. Imposing constraints on the constitute parts of the resulting decomposed component scale matrices leads to a family of parsimonious models. An explicit two-stage parameter estimation procedure is described, and the Bayesian information criterion and the integrated completed likelihood are compared for model selection. This novel family of models is applied to real data, where it is compared to its Gaussian analogue within clustering and classification paradigms.
Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions
Subedi, Sanjeena, McNicholas, Paul D.
Parameter estimation for model-based clustering using a finite mixture of normal inverse Gaussian (NIG) distributions is achieved through variational Bayes approximations. Univariate NIG mixtures and multivariate NIG mixtures are considered. The use of variational Bayes approximations here is a substantial departure from the traditional EM approach and alleviates some of the associated computational complexities and uncertainties. Our variational algorithm is applied to simulated and real data. The paper concludes with discussion and suggestions for future work.
Mixtures of Common Skew-t Factor Analyzers
Murray, Paula M., McNicholas, Paul D., Browne, Ryan P.
A mixture of common skew-t factor analyzers model is introduced for model-based clustering of high-dimensional data. By assuming common component factor loadings, this model allows clustering to be performed in the presence of a large number of mixture components or when the number of dimensions is too large to be well-modelled by the mixtures of factor analyzers model or a variant thereof. Furthermore, assuming that the component densities follow a skew-t distribution allows robust clustering of skewed data. The alternating expectation-conditional maximization algorithm is employed for parameter estimation. We demonstrate excellent clustering performance when our model is applied to real and simulated data.This paper marks the first time that skewed common factors have been used.
Standardizing Interestingness Measures for Association Rules
Shaikh, Mateen, McNicholas, Paul D., Antonie, M. Luiza, Murphy, T. Brendan
Interestingness measures provide information that can be used to prune or select association rules. A given value of an interestingness measure is often interpreted relative to the overall range of the values that the interestingness measure can take. However, properties of individual association rules restrict the values an interestingness measure can achieve. An interesting measure can be standardized to take this into account, but this has only been done for one interestingness measure to date, i.e., the lift. Standardization provides greater insight than the raw value and may even alter researchers' perception of the data. We derive standardized analogues of three interestingness measures and use real and simulated data to compare them to their raw versions, each other, and the standardized lift.
Mixtures of Shifted Asymmetric Laplace Distributions
Franczak, Brian C., Browne, Ryan P., McNicholas, Paul D.
A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the relationship with the general inverse Gaussian distribution. This approach is mathematically elegant and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated on both simulated and real data to illustrate clustering and classification applications. In these analyses, our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the popular Gaussian approach. This work, which marks an important step in the non-Gaussian model-based clustering and classification direction, concludes with discussion as well as suggestions for future work.
A LASSO-Penalized BIC for Mixture Model Selection
Bhattacharya, Sakyajit, McNicholas, Paul D.
The efficacy of family-based approaches to mixture model-based clustering and classification depends on the selection of parsimonious models. Current wisdom suggests the Bayesian information criterion (BIC) for mixture model selection. However, the BIC has well-known limitations, including a tendency to overestimate the number of components as well as a proclivity for, often drastically, underestimating the number of components in higher dimensions. While the former problem might be soluble through merging components, the latter is impossible to mitigate in clustering and classification applications. In this paper, a LASSO-penalized BIC (LPBIC) is introduced to overcome this problem. This approach is illustrated based on applications of extensions of mixtures of factor analyzers, where the LPBIC is used to select both the number of components and the number of latent factors. The LPBIC is shown to match or outperform the BIC in several situations.