Genre
Le terme et le concept : fondements d'une ontoterminologie
Most definitions of ontology, viewed as a "specification of a conceptualization", agree on the fact that if an ontology can take different forms, it necessarily includes a vocabulary of terms and some specification of their meaning in relation to the domain's conceptualization. And as domain knowledge is mainly conveyed through scientific and technical texts, we can hope to extract some useful information from them for building ontology. But is it as simple as this? In this article we shall see that the lexical structure, i.e. the network of words linked by linguistic relationships, does not necessarily match the domain conceptualization. We have to bear in mind that writing documents is the concern of textual linguistics, of which one of the principles is the incompleteness of text, whereas building ontology - viewed as task-independent knowledge - is concerned with conceptualization based on formal and not natural languages. Nevertheless, the famous Sapir and Whorf hypothesis, concerning the interdependence of thought and language, is also applicable to formal languages. This means that the way an ontology is built and a concept is defined depends directly on the formal language which is used; and the results will not be the same. The introduction of the notion of ontoterminology allows to take into account epistemological principles for formal ontology building.
Batch kernel SOM and related Laplacian methods for social network analysis
Boulet, Romain, Jouve, Bertrand, Rossi, Fabrice, Villa, Nathalie
Institut de Mathématiques, Université de Toulouse et CNRS (UMR 5219), 118 route de Narbonne, 31062 Toulouse cedex 9, France Abstract Large graphs are natural mathematical models for describing the structure of the data in a wide variety of fields, such as web mining, social networks, information retrieval, biological networks, etc. For all these applications, automatic tools are required to get a synthetic view of the graph and to reach a good understanding of the underlying problem. In particular, discovering groups of tightly connected vertices and understanding the relations between those groups is very important in practice. This paper shows how a kernel version of the batch Self Organizing Map can be used to achieve these goals via kernels derived from the Laplacian matrix of the graph, especially when it is used in conjunction with more classical methods based on the spectral analysis of the graph. The proposed method is used to explore the structure of a medieval social network modeled through a weighted graph that has been directly built from a large corpus of agrarian contracts. This work was partially supported by ANR Project "Graph-Comp". Preprint submitted to Neurocomputing 19 March 2018 1 Introduction Complex networks are large graphs with a non trivial organization. They arise naturally in numerous context [7], such as, to name a few, the World Wide Web (which gives a perfect example of how large and complex such a network may grow), metabolic pathways, citation networks between scientific articles or more general social networks that model interaction between individuals and/or organizations, etc. Complex networks share common properties that have allowed the emergence of mathematical descriptions such as small world graphs or power law graphs. The structure of these graphs often gives some keys to understand the complex network underlined. To study such a structure, one often begins with a metrology process applied to the graph that describes the degree distribution, the number of components, the density, etc. However, it should be noted that dealing with very large graphs (millions of vertices) is still an open question (see [9] for an example of an efficient algorithm to explore that kind of data sets). Several ways have been explored to cluster the vertices of the graph into communities [43] and some of them have in common the use of the Laplacian matrix. Indeed, there are important relationships between the spectrum of the Laplacian and the graph invariants that characterize its structure (see, e.g. These properties can be used for building, from the eigen-decomposition of the Laplacian, a similarity measure or a metric space such that the induced dissimilarities between vertices of the graph are related to its community structure (see [13], among others).
Toward a statistical mechanics of four letter words
Stephens, Greg J., Bialek, William
Princeton Center for Theoretical Physics, Princeton University, Princeton, New Jersey 08544 USA (Dated: December 13, 2021) We consider words as a network of interacting letters, and approximate the probability distribution of states taken on by this network. Despite the intuition that the rules of English spelling are highly combinatorial (and arbitrary), we find that maximum entropy models consistent with pairwise correlations among letters provide a surprisingly good approximation to the full statistics of four letter words, capturing 92% of the multi-information among letters and even'discovering' real words that were not represented in the data from which the pairwise correlations were estimated. The maximum entropy model defines an energy landscape on the space of possible words, and local minima in this landscape account for nearly two-thirds of words used in written English. Many complex systems convey an impression of order into these controversies about language in the broad that is not so easily captured by the traditional tools of sense, but rather to test the power of pairwise interactions theoretical physics. Thus, it is not clear what sort of to capture seemingly complex structure.
Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
Cawley, Gavin C., Talbot, Nicola L., Girolami, Mark
Multinomial logistic regression provides the standard penalised maximumlikelihood solution to multi-class pattern recognition problems. More recently, the development of sparse multinomial logistic regression models has found application in text processing and microarray classification, where explicit identification of the most informative features is of value. In this paper, we propose a sparse multinomial logistic regression method, in which the sparsity arises from the use of a Laplace prior, but where the usual regularisation parameter is integrated out analytically. Evaluation over a range of benchmark datasets reveals this approach results in similar generalisation performance to that obtained using cross-validation, but at greatly reduced computational expense.
Causal inference in sensorimotor integration
Körding, Konrad P., Tenenbaum, Joshua B.
Many recent studies analyze how data from different modalities can be combined. Often this is modeled as a system that optimally combines several sources of information about the same variable. However, it has long been realized that this information combining depends on the interpretation of the data. Two cues that are perceived by different modalities can have different causal relationships: (1) They can both have the same cause, in this case we should fully integrate both cues into a joint estimate.
Large Scale Hidden Semi-Markov SVMs
Rätsch, Gunnar, Sonnenburg, Sören
We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict segmentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learning problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology.
Modelling transcriptional regulation using Gaussian Processes
Lawrence, Neil D., Sanguinetti, Guido, Rattray, Magnus
Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively easy to measure, such as mRNA decay rates and mRNA abundance levels, it is still very hard to measure the active concentration levels of the transcription factor proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor can be inferred from gene expression levels of a set of known target genes. We treat the protein concentration as a latent function with a Gaussian process prior, and include the sensitivities, mRNA decay rates and baseline expression levels as hyperparameters. We apply this procedure to a human leukemia dataset, focusing on the tumour repressor p53 and obtaining results in good accordance with recent biological studies.
Scalable Discriminative Learning for Natural Language Parsing and Translation
Turian, Joseph, Wellington, Benjamin, Melamed, I. D.
Parsing and translating natural languages can be viewed as problems of predicting tree structures. For machine learning approaches to these predictions, the diversity and high dimensionality of the structures involved mandate very large training sets. This paper presents a purely discriminative learning method that scales up well to problems of this size. Its accuracy was at least as good as other comparable methods on a standard parsing task. To our knowledge, it is the first purely discriminative learning algorithm for translation with treestructured models. Unlike other popular methods, this method does not require a great deal of feature engineering a priori, because it performs feature selection over a compound feature space as it learns. Experiments demonstrate the method's versatility, accuracy, and efficiency. Relevant software is freely available at http://nlp.cs.nyu.edu/parser and http://nlp.cs.nyu.edu/GenPar.
Emergence of conjunctive visual features by quadratic independent component analysis
Lindgren, J.t., Hyvärinen, Aapo
In previous studies, quadratic modelling of natural images has resulted in cell models that react strongly to edges and bars. Here we apply quadratic Independent Component Analysis to natural image patches, and show that up to a small approximation error, the estimated components are computing conjunctions of two linear features. These conjunctive features appear to represent not only edges and bars, but also inherently two-dimensional stimuli, such as corners. In addition, we show that for many of the components, the underlying linear features have essentially V1 simple cell receptive field characteristics. Our results indicate that the development of the V2 cells preferring angles and corners may be partly explainable by the principle of unsupervised sparse coding of natural images.