Goto

Collaborating Authors

 Do, Linh


Dendrogram of mixing measures: Hierarchical clustering and model selection for finite mixture models

arXiv.org Machine Learning

In modern data analysis, it is often useful to reduce the complexity of a large dataset by clustering the observations into a small and interpretable collection of subpopulations. Broadly speaking, there are two major approaches. In "model-based" clustering, the data are assumed to be generated by a (usually small) collection of simple probability distributions such as normal distributions, and clusters are inferred by fitting a probabilistic mixture model. Because of their transparent probabilistic assumptions, the statistical properties of mixture models are well-understood. In particular, if there is no model misspecification, i.e., the data truly come from a mixture distribution, then the subpopulations can be consistently estimated. Unfortunately, this appealing asymptotic guarantee is somewhat at odds with what is often observed in practice, whereby mixture models fitted to complex datasets often return an uninterpretably large number of components, many of which are quite similar to each other. The tendency of mixture models to overfit on real data leads many analysts to employ "model-free" clustering methods instead. A well-known example is hierarchical clustering, which organizes the data into a nested sequence of partitions at different resolutions. It is particularly useful for data exploration as it does not require fixing a number of subpopulations a priori and can be visualized using a dendrogram.


Strong identifiability and parameter learning in regression with heterogeneous response

arXiv.org Machine Learning

Regression is often associated with the task of curve fitting -- given data samples for pairs of random variables (X, Y), find a function y = F (x) that captures the relationship between X and Y as well as possible. As the underlying population for the (X, Y) pairs becomes increasingly complex, much efforts have been devoted to learning more complex models for the (regression) function F; see [20, 49, 15] for some recent examples. In many data domains, however, due to the heterogeneity of the behavior of the response variable Y with respect to covariate X, no single function F can fit the data pairs well, no matter how complex F is. Many authors noticed this challenge and adopted a mixture modeling framework into the regression problem, starting with some earlier work of [51, 6, 14]. To capture the uncertain and highly heterogeneous behavior of response variable Y given covariate X, one needs more than one single regression model. Suppose that there are k different regression behaviors, one can represent the conditional distribution of Y given X by a mixture of k conditional density functions associated with k underlying (latent) subpopulations. One can draw from the existing modeling tools of conditional densities such as generalized linear models [39], or more complex components [28, 63, 22] to increase the model fitness for the regression task. Recently, mixture of regression models (alternatively, regression mixture models) have found their applications in a vast range of domains, including risk estimation [2], education [7], medicine [34, 43, 56] and transportation analysis [46, 47, 64]. Making inferences in mixture of regression models can be done in a classical frequentist framework (e.g., maximum conditional likelihood estimation [6]), or a Bayesian framework [27].