outdegree
A Link between Coding Theory and Cross-Validation with Applications
Pahikkala, Tapio, Movahedi, Parisa, Montoya, Ileana, Miikonen, Havu, Foldes, Stephan, Airola, Antti, Major, Laszlo
How many different binary classification problems a single learning algorithm can solve on a fixed data with exactly zero or at most a given number of cross-validation errors? While the number in the former case is known to be limited by the no-free-lunch theorem, we show that the exact answers are given by the theory of error detecting codes. As a case study, we focus on the AUC performance measure and leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out at a time. We show that the maximal number of classification problems with fixed class proportion, for which a learning algorithm can achieve zero LPOCV error, equals the maximal number of code words in a constant weight code (CWC), with certain technical properties. We then generalize CWCs by introducing light CWCs, and prove an analogous result for nonzero LPOCV errors and light CWCs. Moreover, we prove both upper and lower bounds on the maximal numbers of code words in light CWCs. Finally, as an immediate practical application, we develop new LPOCV based randomization tests for learning algorithms that generalize the classical Wilcoxon-Mann-Whitney U test.
Answering Regular Path Queries Over SQ Ontologies
Gutiérrez-Basulto, Víctor, Ibáñez-García, Yazmín, Jung, Jean Christoph
We study query answering in the description logic $\mathcal{SQ}$ supporting qualified number restrictions on both transitive and non-transitive roles. Our main contributions are a tree-like model property for $\mathcal{SQ}$ knowledge bases and, building upon this, an optimal automata-based algorithm for answering positive existential regular path queries in 2ExpTime.
On Finite and Unrestricted Query Entailment beyond SQ with Number Restrictions on Transitive Roles
Gogacz, Thomas, Gutiérrez-Basulto, Víctor, Ibáñez-García, Yazmín, Jung, Jean Christoph, Murlak, Filip
We study the description logic SQ with number restrictions applicable to transitive roles, extended with either nominals or inverse roles. We show tight 2EXPTIME upper bounds for unrestricted entailment of regular path queries for both extensions and finite entailment of positive existential queries for nominals. For inverses, we establish 2EXPTIME-completeness for unrestricted and finite entailment of instance queries (the latter under restriction to a single, transitive role).
Minesweeper with Limited Moves
Gaspers, Serge (UNSW Sydney and Data61, CSIRO) | Rümmele, Stefan (UNSW Sydney and University of Sydney) | Saffidine, Abdallah (UNSW Sydney) | Tran, Kevin (UNSW Sydney)
We consider the problem of playing Minesweeper with a limited number of moves: Given a partially revealed board, a number of available clicks k, and a target probability p, can we win with probability p. We win if we do not click on a mine, and, after our sequence of at most k clicks (which reveal information about the neighboring squares) can correctly identify the placement of all mines. We make the assumption, that, at all times, all placements of mines consistent with the currently revealed squares are equiprobable. Our main results are that the problem is PSPACE-complete, and it remains PSPACE-complete when p is a constant, in particular when p = 1. When k = 0 (i.e., we are not allowed to click anywhere), the problem is PP-complete in general, but co-NP-complete when p is a constant, and in particular when p = 1.
Toward computational cumulative biology by combining models of biological datasets
Faisal, Ali, Peltonen, Jaakko, Georgii, Elisabeth, Rung, Johan, Kaski, Samuel
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to both include biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer and the model-based search was more accurate than keyword search; it moreover recovered biologically meaningful relationships that are not straightforwardly visible from annotations, for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
The Sum-over-Forests density index: identifying dense regions in a graph
Senelle, Mathieu, Garcia-Diez, Silvia, Mantrach, Amin, Shimbo, Masashi, Saerens, Marco, Fouss, François
This work introduces a novel nonparametric density index defined on graphs, the Sum-over-Forests (SoF) density index. It is based on a clear and intuitive idea: high-density regions in a graph are characterized by the fact that they contain a large amount of low-cost trees with high outdegrees while low-density regions contain few ones. Therefore, a Boltzmann probability distribution on the countable set of forests in the graph is defined so that large (high-cost) forests occur with a low probability while short (low-cost) forests occur with a high probability. Then, the SoF density index of a node is defined as the expected outdegree of this node in a non-trivial tree of the forest, thus providing a measure of density around that node. Following the matrix-forest theorem, and a statistical physics framework, it is shown that the SoF density index can be easily computed in closed form through a simple matrix inversion. Experiments on artificial and real data sets show that the proposed index performs well on finding dense regions, for graphs of various origins.
Graph Degree Linkage: Agglomerative Clustering on a Directed Graph
Zhang, Wei, Wang, Xiaogang, Zhao, Deli, Tang, Xiaoou
This paper proposes a simple but effective graph-based agglomerative algorithm, for clustering high-dimensional data. We explore the different roles of two fundamental concepts in graph theory, indegree and outdegree, in the context of clustering. The average indegree reflects the density near a sample, and the average outdegree characterizes the local geometry around a sample. Based on such insights, we define the affinity measure of clusters via the product of average indegree and average outdegree. The product-based affinity makes our algorithm robust to noise. The algorithm has three main advantages: good performance, easy implementation, and high computational efficiency. We test the algorithm on two fundamental computer vision problems: image clustering and object matching. Extensive experiments demonstrate that it outperforms the state-of-the-arts in both applications.
From formulas to cirquents in computability logic
Computability logic (CoL) (see http://www.cis.upenn.edu/~giorgi/cl.html) is a recently introduced semantical platform and ambitious program for redeveloping logic as a formal theory of computability, as opposed to the formal theory of truth that logic has more traditionally been. Its expressions represent interactive computational tasks seen as games played by a machine against the environment, and "truth" is understood as existence of an algorithmic winning strategy. With logical operators standing for operations on games, the formalism of CoL is open-ended, and has already undergone series of extensions. This article extends the expressive power of CoL in a qualitatively new way, generalizing formulas (to which the earlier languages of CoL were limited) to circuit-style structures termed cirquents. The latter, unlike formulas, are able to account for subgame/subtask sharing between different parts of the overall game/task. Among the many advantages offered by this ability is that it allows us to capture, refine and generalize the well known independence-friendly logic which, after the present leap forward, naturally becomes a conservative fragment of CoL, just as classical logic had been known to be a conservative fragment of the formula-based version of CoL. Technically, this paper is self-contained, and can be read without any prior familiarity with CoL.
Fixing a Tournament
Williams, Virginia Vassilevska (University of California, Berkeley)
We consider a very natural problem concerned with game manipulation. Let G be a directed graph where the nodes represent players of a game, and an edge from u to v means that u can beat v in the game. (If an edge ( u, v ) is not present, one cannot match u and v. ) Given G and a "favorite" node A , is it possible to set up the bracket of a balanced single-elimination tournament so that A is guaranteed to win, if matches occur as predicted by G? We show that the problem is NP-complete for general graphs. For the case when G is a tournament graph we give several interesting conditions on the desired winner A for which there exists a balanced single-elimination tournament which A wins, and it can be found in polynomial time.