Alharbi, Basma (King Abdullah University of Science and Technology (KAUST)) | Qahtan, Abdulhakim (King Abdullah University of Science and Technology (KAUST)) | Zhang, Xiangliang (King Abdullah University of Science and Technology (KAUST))

Utilizing trajectories for modeling human mobility often involves extracting descriptive features for each individual, a procedure heavily based on experts' knowledge. In this work, our objective is to minimize human involvement and exploit the power of community in learning `features' for individuals from their location traces. We propose a probabilistic graphical model that learns distribution of latent concepts, named motifs, from anonymized sequences of user locations. To handle variation in user activity level, our model learns motif distributions from sequence-level location co-occurrence of all users. To handle the big variation in location popularity, our model uses an asymmetric prior, conditioned on per-sequence features. We evaluate the new representation in a link prediction task and compare our results to those of baseline approaches.

Most of existing methods for DNA motif discovery consider only a single set of sequences to find an over-represented motif. In contrast, we consider multiple sets of sequences where we group sets associated with the same motif into a cluster, assuming that each set involves a single motif. Clustering sets of sequences yields clusters of coherent motifs, improving signal-to-noise ratio or enabling us to identify multiple motifs. We present a probabilistic model for DNA motif discovery where we identify multiple motifs through searching for patterns which are shared across multiple sets of sequences. Our model infers cluster-indicating latent variables and learns motifs simultaneously, where these two tasks interact with each other.

In this spirit, Sali and Blundell (1990) develop an elaborate scheme for the comparison of protein structures. The results of a comparison form a "generalized protein," which can be used in predicting 3D conformation of the sequence of the unknown. Similar to the work of Lathrop et al. (1987), proteins are described by a hierarchy, with each level being a sequence of typed elements. Elements of fragments are represented by a host of computed properties, rather than by a single identifier. Attributes of fragment elements can refer to other elements in the sequence, thus representing binary relationships such as hydrogen bonding between elements.

Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler aagorithms for discovering motifs. These algorithms axe sometimes prone to producing models that are incorrect because two or more patterns have been tombitted. The statistical model produced in this situation is a convex combination (weighted average) two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as past of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. Keywords: sequence mod ing; Dirichlet priors; expectation ma -dmization; machine learning; protein motifs; hidden Markov models; unsupervised learning; sequence alignment, multiple Introduction A convex combination occurs when a model combines two or more sequence patterns that should be distinct. This can occur when a sequence pattern discovery algorithm tries to fit a model that is either too short (multiple alignment algorithms) or has too few components (motif discovery algorithms). Since reducing the number of free parameters in the model is generally desirable, many pattern discovery algorithms use heuristics to minimize the length of the sequence model.

O'Callaghan, Derek (University College Dublin) | Harrigan, Martin (University College Dublin) | Carthy, Joe (University College Dublin) | Cunningham, Pádraig (University College Dublin)

As the popularity of content sharing websites has increased, they have become targets for spam, phishing and the distribution of malware. On YouTube, the facility for users to post comments can be used by spam campaigns to direct unsuspecting users to malicious third-party websites. In this paper, we demonstrate how such campaigns can be tracked over time using network motif profiling, i.e. by tracking counts of indicative network motifs. By considering all motifs of up to five nodes, we identify discriminating motifs that reveal two distinctly different spam campaign strategies, and present an evaluation that tracks two corresponding active campaigns.