Goto

Collaborating Authors

 multiple alignment


Condorcet's Jury Theorem for Consensus Clustering and its Implications for Diversity

arXiv.org Machine Learning

Condorcet's Jury Theorem has been invoked for ensemble classifiers to indicate that the combination of many classifiers can have better predictive performance than a single classifier. Such a theoretical underpinning is unknown for consensus clustering. This article extends Condorcet's Jury Theorem to the mean partition approach under the additional assumptions that a unique ground-truth partition exists and sample partitions are drawn from a sufficiently small ball containing the ground-truth. As an implication of practical relevance, we question the claim that the quality of consensus clustering depends on the diversity of the sample partitions. Instead, we conjecture that limiting the diversity of the mean partitions is necessary for controlling the quality.


The SP theory of intelligence and the representation and processing of knowledge in the brain

arXiv.org Artificial Intelligence

The "SP theory of intelligence", with its realisation in the "SP computer model", aims to simplify and integrate observations and concepts across AI-related fields, with information compression as a unifying theme. This paper describes how abstract structures and processes in the theory may be realised in terms of neurons, their interconnections, and the transmission of signals between neurons. This part of the SP theory -- "SP-neural" -- is a tentative and partial model for the representation and processing of knowledge in the brain. In the SP theory (apart from SP-neural), all kinds of knowledge are represented with "patterns", where a pattern is an array of atomic symbols in one or two dimensions. In SP-neural, the concept of a "pattern" is realised as an array of neurons called a "pattern assembly", similar to Hebb's concept of a "cell assembly" but with important differences. Central to the processing of information in the SP system is the powerful concept of "multiple alignment", borrowed and adapted from bioinformatics. Processes such as pattern recognition, reasoning and problem solving are achieved via the building of multiple alignments, while unsupervised learning -- significantly different from the "Hebbian" kinds of learning -- is achieved by creating patterns from sensory information and also by creating patterns from multiple alignments in which there is a partial match between one pattern and another. Short-lived neural structures equivalent to multiple alignments will be created via an inter-play of excitatory and inhibitory neural signals. The paper discusses several associated issues, with relevant empirical evidence.


IISCNLP at SemEval-2016 Task 2: Interpretable STS with ILP based Multiple Chunk Aligner

arXiv.org Machine Learning

Interpretable semantic textual similarity (iSTS) task adds a crucial explanatory layer to pairwise sentence similarity. We address various components of this task: chunk level semantic alignment along with assignment of similarity type and score for aligned chunks with a novel system presented in this paper. We propose an algorithm, iMATCH, for the alignment of multiple non-contiguous chunks based on Integer Linear Programming (ILP). Similarity type and score assignment for pairs of chunks is done using a supervised multiclass classification technique based on Random Forrest Classifier. Results show that our algorithm iMATCH has low execution time and outperforms most other participating systems in terms of alignment score. Of the three datasets, we are top ranked for answer- students dataset in terms of overall score and have top alignment score for headlines dataset in the gold chunks track.


The Mean Partition Theorem of Consensus Clustering

arXiv.org Machine Learning

Clustering is a standard technique for exploratory data analysis that finds applications across different disciplines such as computer science, biology, marketing, and social science. The goal of clustering is to group a set of unlabeled data points into several clusters based on some notion of dissimilarity. Inspired by the success of classifier ensembles, consensus clustering has emerged as a research topic [8, 23]. Consensus clustering first generates several partitions of the same dataset. Then it combines the sample partitions to a single consensus partition. The assumption is that a consensus partition better fits to the hidden structure in the data than individual partitions. One standard approach of consensus clustering combines the sample partitions to a mean partition [3, 4, 5, 6, 9, 17, 20, 21, 22].


The SP theory of intelligence: an overview

arXiv.org Artificial Intelligence

This article is an overview of the "SP theory of intelligence". The theory aims to simplify and integrate concepts across artificial intelligence, mainstream computing and human perception and cognition, with information compression as a unifying theme. It is conceived as a brain-like system that receives 'New' information and stores some or all of it in compressed form as 'Old' information. It is realised in the form of a computer model -- a first version of the SP machine. The concept of "multiple alignment" is a powerful central idea. Using heuristic techniques, the system builds multiple alignments that are 'good' in terms of information compression. For each multiple alignment, probabilities may be calculated. These provide the basis for calculating the probabilities of inferences. The system learns new structures from partial matches between patterns. Using heuristic techniques, the system searches for sets of structures that are 'good' in terms of information compression. These are normally ones that people judge to be 'natural', in accordance with the 'DONSVIC' principle -- the discovery of natural structures via information compression. The SP theory may be applied in several areas including 'computing', aspects of mathematics and logic, representation of knowledge, natural language processing, pattern recognition, several kinds of reasoning, information storage and retrieval, planning and problem solving, information compression, neuroscience, and human perception and cognition. Examples include the parsing and production of language including discontinuous dependencies in syntax, pattern recognition at multiple levels of abstraction and its integration with part-whole relations, nonmonotonic reasoning and reasoning with default values, reasoning in Bayesian networks including 'explaining away', causal diagnosis, and the solving of a geometric analogy problem.


Smart machines and the SP theory of intelligence

arXiv.org Artificial Intelligence

These notes describe how the "SP theory of intelligence", and its embodiment in the "SP machine", may help to realise cognitive computing, as described in the book "Smart Machines". In the SP system, information compression and a concept of "multiple alignment" are centre stage. The system is designed to integrate such things as unsupervised learning, pattern recognition, probabilistic reasoning, and more. It may help to overcome the problem of variety in big data, it may serve in pattern recognition and in the unsupervised learning of structure in data, and it may facilitate the management and transmission of big data. There is potential, via information compression, for substantial gains in computational efficiency, especially in the use of energy. The SP system may help to realise data-centric computing, perhaps via a development of Hebb's concept of a "cell assembly", or via the use of light or DNA for the processing of information. It has potential in the management of errors and uncertainty in data, in medical diagnosis, in processing streams of data, and in promoting adaptability in robots.


Computing as compression: the SP theory of intelligence

arXiv.org Artificial Intelligence

This paper provides an overview of the SP theory of intelligence and its central idea that artificial intelligence, mainstream computing, and much of human perception and cognition, may be understood as information compression. The background and origins of the SP theory are described, and the main elements of the theory, including the key concept of multiple alignment, borrowed from bioinformatics but with important differences. Associated with the SP theory is the idea that redundancy in information may be understood as repetition of patterns, that compression of information may be achieved via the matching and unification (merging) of patterns, and that computing and information compression are both fundamentally probabilistic. It appears that the SP system is Turing-equivalent in the sense that anything that may be computed with a Turing machine may, in principle, also be computed with an SP machine. One of the main strengths of the SP theory and the multiple alignment concept is in modelling concepts and phenomena in artificial intelligence. Within that area, the SP theory provides a simple but versatile means of representing different kinds of knowledge, it can model both the parsing and production of natural language, with potential for the understanding and translation of natural languages, it has strengths in pattern recognition, with potential in computer vision, it can model several kinds of reasoning, and it has capabilities in planning, problem solving, and unsupervised learning. The paper includes two examples showing how alternative parsings of an ambiguous sentence may be modelled as multiple alignments, and another example showing how the concept of multiple alignment may be applied in medical diagnosis.


Evolutionary distances in the twilight zone -- a rational kernel approach

arXiv.org Machine Learning

Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.


Kernels for gene regulatory regions

Neural Information Processing Systems

We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes.


Kernels for gene regulatory regions

Neural Information Processing Systems

We describe a hierarchy of motif-based kernels for multiple alignments of biological sequences, particularly suitable to process regulatory regions of genes. The kernels incorporate progressively more information, with the most complex kernel accounting for a multiple alignment of orthologous regions, the phylogenetic tree relating the species, and the prior knowledge that relevant sequence patterns occur in conserved motif blocks. These kernels can be used in the presence of a library of known transcription factor binding sites, or de novo by iterating over all k-mers of a given length. In the latter mode, a discriminative classifier built from such a kernel not only recognizes a given class of promoter regions, but as a side effect simultaneously identifies a collection of relevant, discriminative sequence motifs. We demonstrate the utility of the motif-based multiple alignment kernels by using a collection of aligned promoter regions from five yeast species to recognize classes of cell-cycle regulated genes.