Plotting

 Passerini, Andrea


Bootstrapping Domain Ontologies from Wikipedia: A Uniform Approach

AAAI Conferences

Building ontologies is a difficult task requiring skills in logics and ontological analysis. Domain experts usually reach as far as organizing a set of concepts into a hierarchy in which the semantics of the relations is under-specified. The categorization of Wikipedia is a huge concept hierarchy of this form, covering a broad range of areas. We propose an automatic method for bootstrapping domain ontologies from the categories of Wikipedia. The method first selects a subset of concepts that are relevant for a given domain. The relevant concepts are subsequently split into classes and individuals, and, finally, the relations between the concepts are classified into subclass_of, instance_of, part_of, and generic related_to. We evaluate our method by generating ontology skeletons for the domains of Computing and Music. The quality of the generated ontologies has been measured against manually built ground truth datasets of several hundred nodes.


Hybrid SRL with Optimization Modulo Theories

arXiv.org Machine Learning

Generally speaking, the goal of constructive learning could be seen as, given an example set of structured objects, to generate novel objects with similar properties. From a statistical-relational learning (SRL) viewpoint, the task can be interpreted as a constraint satisfaction problem, i.e. the generated objects must obey a set of soft constraints, whose weights are estimated from the data. Traditional SRL approaches rely on (finite) First-Order Logic (FOL) as a description language, and on MAX-SAT solvers to perform inference. Alas, FOL is unsuited for con- structive problems where the objects contain a mixture of Boolean and numerical variables. It is in fact difficult to implement, e.g. linear arithmetic constraints within the language of FOL. In this paper we propose a novel class of hybrid SRL methods that rely on Satisfiability Modulo Theories, an alternative class of for- mal languages that allow to describe, and reason over, mixed Boolean-numerical objects and constraints. The resulting methods, which we call Learning Mod- ulo Theories, are formulated within the structured output SVM framework, and employ a weighted SMT solver as an optimization oracle to perform efficient in- ference and discriminative max margin weight learning. We also present a few examples of constructive learning applications enabled by our method.


Predicting Structural and Functional Sites in Proteins by Searching for Maximum-weight Cliques

AAAI Conferences

Fully characterizing structural and functional sites in proteins is a fundamental step in understanding their roles in the cell. This extremely challenging combinatorial problem requires determining the number of sites in the protein and the set of residues involved in each of them. We formulate it as a distance-based supervised clustering task, where training proteins are employed to learn a proper distance function between residues. A partial clustering is then returned by searching for maximum-weight cliques in the resulting weighted graph representation of proteins. A novel stochastic local search algorithm is proposed to efficiently generate approximate solutions. Our method achieves substantial improvements over a previous structured-output approach for metal binding site prediction. Significant improvements over the current state-of-the-art are also achieved in predicting catalytic sites from 3D structure in enzymes.


Predicting the Geometry of Metal Binding Sites from Protein Sequence

Neural Information Processing Systems

Metal binding is important for the structural and functional characterization of proteins. Previous prediction efforts have only focused on bonding state, i.e. deciding which protein residues act as metal ligands in some binding site. Identifying the geometry of metal-binding sites, i.e. deciding which residues are jointly involved in the coordination of a metal ion is a new prediction problem that has been never attempted before from protein sequence alone. In this paper, we formulate it in the framework of learning with structured outputs. Our solution relies on the fact that, from a graph theoretical perspective, metal binding has the algebraic properties of a matroid, enabling the application of greedy algorithms for learning structured outputs. On a data set of 199 non-redundant metalloproteins, we obtained precision/recall levels of 75\%/46\% correct ligand-ion assignments, which improves to 88\%/88\% in the setting where the metal binding state is known.