Genre
Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data
Savage, Richard S., Ghahramani, Zoubin, Griffin, Jim E., Kirk, Paul, Wild, David L.
We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/
Link Prediction with Social Vector Clocks
Lee, Conrad, Nick, Bobo, Brandes, Ulrik, Cunningham, Pรกdraig
State-of-the-art link prediction utilizes combinations of complex features derived from network panel data. We here show that computationally less expensive features can achieve the same performance in the common scenario in which the data is available as a sequence of interactions. Our features are based on social vector clocks, an adaptation of the vector-clock concept introduced in distributed computing to social interaction networks. In fact, our experiments suggest that by taking into account the order and spacing of interactions, social vector clocks exploit different aspects of link formation so that their combination with previous approaches yields the most accurate predictor to date.
Identifying Independence in Relational Models
The rules of d-separation provide a framework for deriving conditional independence facts from model structure. However, this theory only applies to simple directed graphical models. We introduce relational d-separation, a theory for deriving conditional independence in relational models. We provide a sound, complete, and computationally efficient method for relational d-separation, and we present empirical results that demonstrate effectiveness.
Managing sparsity, time, and quality of inference in topic models
Noname manuscript No. (will be inserted by the editor) Abstract Inference is an integral part of probabilistic topic models, but is often nontrivial to derive an efficient algorithm for a specific model. It is even much more challenging when we want to find a fast inference algorithm which always yields sparse latent representations of documents. In this article, we introduce a simple framework for inference in probabilistic topic models, denoted by FW. This framework is general and flexible enough to be easily adapted to mixture models. It has a linear convergence rate, offers an easy way to incorporate prior knowledge, and provides us an easy way to directly trade off sparsity against quality and time. We demonstrate the goodness and flexibility of FW over existing inference methods by a number of tasks. Finally, we show how inference in topic models with nonconjugate priors can be done efficiently. Keywords Topic modeling ยท Fast inference ยท Sparsity ยท Tradeoff ยท Greedy sparse approximation 1 Introduction We are interested in the two important problems in developing probabilistic topic models: sparsity and time. The sparsity problem is to infer sparse latent representations of documents, while the second problem asks for an efficient inference algorithm for a topic model. These two problems have been attracting significant interest in recent years, because of their significant impacts and nontrivial nature. Inference is an integral part of any topic models, and is often NPhard (Sontag and Roy, 2011).
Towards more accurate clustering method by using dynamic time warping
An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time grows as the size and complexity of the training dataset increases. For this reason, it is important to have efficient computational methods and algorithms that can be applied on large datasets, such that it is still possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model (HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances with similar inputs are clustered and a weight factor which represents the frequency of these instances is assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity function to cluster similar examples. In the second step, all formulas in the classical HMM training algorithm (EM) associated with the number of training instances are modified to include the weight factor in appropriate terms. This process significantly accelerates HMM training while maintaining the same initial, transition and emission probabilities matrixes as those obtained with the classical HMM training algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set, speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
General Quantum Hilbert Space Modeling Scheme for Entanglement
Aerts, Diederik, Sozzo, Sandro
We work out a classification scheme for quantum modeling in Hilbert space of any kind of composite entity violating Bell's inequalities and exhibiting entanglement. Our theoretical framework includes situations with entangled states and product measurements ('customary quantum situation'), and also situations with both entangled states and entangled measurements ('nonlocal box situation', 'nonlocal non-marginal box situation'). We show that entanglement is structurally a joint property of states and measurements. Furthermore, entangled measurements enable quantum modeling of situations that are usually believed to be 'beyond quantum'. Our results are also extended from pure states to quantum mixtures.
Distributed dictionary learning over a sensor network
Chainais, Pierre, Richard, Cรฉdric
We consider the problem of distributed dictionary learning, where a set of nodes is required to collectively learn a common dictionary from noisy measurements. This approach may be useful in several contexts including sensor networks. Diffusion cooperation schemes have been proposed to solve the distributed linear regression problem. In this work we focus on a diffusion-based adaptive dictionary learning strategy: each node records observations and cooperates with its neighbors by sharing its local dictionary. The resulting algorithm corresponds to a distributed block coordinate descent (alternate optimization). Beyond dictionary learning, this strategy could be adapted to many matrix factorization problems and generalized to various settings. This article presents our approach and illustrates its efficiency on some numerical examples. Keywords: dictionary learning, sparse coding, distributed estimation, diffusion, matrix factorization, adaptive networks, block coordinate descent.
Sparsity regret bounds for individual sequences in online linear regression
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.
From Constraints to Resolution Rules, Part I: Conceptual Framework
Many real world problems naturally appear as constraints satisfaction problems (CSP), for which very efficient algorithms are known. Most of these involve the combination of two techniques: some direct propagation of constraints between variables (with the goal of reducing their sets of possible values) and some kind of structured search (depth-first, breadth-first,...). But when such blind search is not possible or not allowed or when one wants a 'constructive' or a 'pattern-based' solution, one must devise more complex propagation rules instead. In this case, one can introduce the notion of a candidate (a 'still possible' value for a variable). Here, we give this intuitive notion a well defined logical status, from which we can define the concepts of a resolution rule and a resolution theory. In order to keep our analysis as concrete as possible, we illustrate each definition with the well known Sudoku example. Part I proposes a general conceptual framework based on first order logic; with the introduction of chains and braids, Part II will give much deeper results.
From Constraints to Resolution Rules, Part II: chains, braids, confluence and T&E
In this Part II, we apply the general theory developed in Part I to a detailed analysis of the Constraint Satisfaction Problem (CSP). We show how specific types of resolution rules can be defined. In particular, we introduce the general notions of a chain and a braid. As in Part I, these notions are illustrated in detail with the Sudoku example - a problem known to be NP-complete and which is therefore typical of a broad class of hard problems. For Sudoku, we also show how far one can go in 'approximating' a CSP with a resolution theory and we give an empirical statistical analysis of how the various puzzles, corresponding to different sets of entries, can be classified along a natural scale of complexity. For any CSP, we also prove the confluence property of some Resolution Theories based on braids and we show how it can be used to define different resolution strategies. Finally, we prove that, in any CSP, braids have the same solving capacity as Trial-and-Error (T&E) with no guessing and we comment this result in the Sudoku case.