Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler aagorithms for discovering motifs. These algorithms axe sometimes prone to producing models that are incorrect because two or more patterns have been tombitted. The statistical model produced in this situation is a convex combination (weighted average) two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as past of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. Keywords: sequence mod ing; Dirichlet priors; expectation ma -dmization; machine learning; protein motifs; hidden Markov models; unsupervised learning; sequence alignment, multiple Introduction A convex combination occurs when a model combines two or more sequence patterns that should be distinct. This can occur when a sequence pattern discovery algorithm tries to fit a model that is either too short (multiple alignment algorithms) or has too few components (motif discovery algorithms). Since reducing the number of free parameters in the model is generally desirable, many pattern discovery algorithms use heuristics to minimize the length of the sequence model.
Jan-11-2006, 15:50:10 GMT