Genre
A practical approach to language complexity: a Wikipedia case study
Yasseri, Taha, Kornai, András, Kertész, János
We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet in practice the vocabulary richness of both samples are at the same level. Detailed analysis of longer units (n-grams of words and part of speech tags) shows that the language of Simple is less complex than that of Main primarily due to the use of shorter sentences, as opposed to drastically simplified syntax or vocabulary. Comparing the two language varieties by the Gunning readability index supports this conclusion. We also report on the topical dependence of language complexity, e.g. that the language is more advanced in conceptual articles compared to person-based (biographical) and object-based articles. Finally, we investigate the relation between conflict and language complexity by analyzing the content of the talk pages associated to controversial and peacefully developing articles, concluding that controversy has the effect of reducing language complexity.
Information-theoretic Dictionary Learning for Image Classification
Qiu, Qiang, Patel, Vishal M., Chellappa, Rama
We present a two-stage approach for learning dictionaries for object classification tasks based on the principle of information maximization. The proposed method seeks a dictionary that is compact, discriminative, and generative. In the first stage, dictionary atoms are selected from an initial dictionary by maximizing the mutual information measure on dictionary compactness, discrimination and reconstruction. In the second stage, the selected dictionary atoms are updated for improved reconstructive and discriminative power using a simple gradient ascent algorithm on mutual information. Experiments using real datasets demonstrate the effectiveness of our approach for image classification tasks.
Bayesian and L1 Approaches to Sparse Unsupervised Learning
Mohamed, Shakir, Heller, Katherine, Ghahramani, Zoubin
The use of L1 regularisation for sparse learning has generated immense research interest, with successful application in such diverse areas as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically underperforms in terms of predictive performance when compared with other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of "L1", and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner and avoiding unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform L1 minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.
Evaluating Ontology Matching Systems on Large, Multilingual and Real-world Test Cases
Meilicke, Christian, Sváb-Zamazal, Ondrej, Trojahn, Cássia, Jiménez-Ruiz, Ernesto, Aguirre, José-Luis, Stuckenschmidt, Heiner, Grau, Bernardo Cuenca
In the field of ontology matching, the most systematic evaluation of matching systems is established by the Ontology Alignment Evaluation Initiative (OAEI), which is an annual campaign for evaluating ontology matching systems organized by different groups of researchers. In this paper, we report on the results of an intermediary OAEI campaign called OAEI 2011.5. The evaluations of this campaign are divided in five tracks. Three of these tracks are new or have been improved compared to previous OAEI campaigns. Overall, we evaluated 18 matching systems. We discuss lessons learned, in terms of scalability, multilingual issues and the ability do deal with real world cases from different domains.
Predictive Information Rate in Discrete-time Gaussian Processes
Abdallah, Samer A., Plumbley, Mark D.
We derive expressions for the predicitive information rate (PIR) for the class of autoregressive Gaussian processes AR(N), both in terms of the prediction coefficients and in terms of the power spectral density. The latter result suggests a duality between the PIR and the multi-information rate for processes with mutually inverse power spectra (i.e. with poles and zeros of the transfer function exchanged). We investigate the behaviour of the PIR in relation to the multi-information rate for some simple examples, which suggest, somewhat counter-intuitively, that the PIR is maximised for very `smooth' AR processes whose power spectra have multiple poles at zero frequency. We also obtain results for moving average Gaussian processes which are consistent with the duality conjectured earlier. One consequence of this is that the PIR is unbounded for MA(N) processes.
Efficient Algorithm for Extremely Large Multi-task Regression with Massive Structured Sparsity
We develop a highly scalable optimization method called "hierarchical group-thresholding" for solving a multi-task regression model with complex structured sparsity constraints on both input and output spaces. Despite the recent emergence of several efficient optimization algorithms for tackling complex sparsity-inducing regularizers, true scalability in practical high-dimensional problems where a huge amount (e.g., millions) of sparsity patterns need to be enforced remains an open challenge, because all existing algorithms must deal with ALL such patterns exhaustively in every iteration, which is computationally prohibitive. Our proposed algorithm addresses the scalability problem by screening out multiple groups of coefficients simultaneously and systematically. We employ a hierarchical tree representation of group constraints to accelerate the process of removing irrelevant constraints by taking advantage of the inclusion relationships between group sparsities, thereby avoiding dealing with all constraints in every optimization step, and necessitating optimization operation only on a small number of outstanding coefficients. In our experiments, we demonstrate the efficiency of our method on simulation datasets, and in an application of detecting genetic variants associated with gene expression traits.
A Plea for Neutral Comparison Studies in Computational Sciences
Boulesteix, Anne-Laure, Eugster, Manuel J. A.
In a context where most published articles are devoted to the development of "new methods", comparison studies are generally appreciated by readers but surprisingly given poor consideration by many scientific journals. In connection with recent articles on over-optimism and epistemology published in Bioinformatics, this letter stresses the importance of neutral comparison studies for the objective evaluation of existing methods and the establishment of standards by drawing parallels with clinical research.
Nonparametric sparsity and regularization
Rosasco, Lorenzo, Villa, Silvia, Mosci, Sofia, Santoro, Matteo, verri, Alessandro
It is now common to see practical applications, for example in bioinformatics and computer vision, where the dimensionality of the data is in the order of hundreds, thousands and even tens of thousands. It is known that learning in such a high dimensional regime is feasible only if the quantity to be estimated satisfies some regularity assumptions [24]. In particular, the idea behind, so called, sparsity is that the quantity of interest depends only on a few relevant variables (dimensions). In turn, this latter assumption is often at the basis of the construction of interpretable data models, since the relevant dimensions allow for a compact, hence interpretable, representation. An instance of the above situation is the problem of learning from samples a multivariate function which depends only on a (possibly small) subset of relevant variables. Detecting such variables is the problem of variable selection. Largely motivated by recent advances in compressed sensing [15, 25], the above problem has been extensively studied under the assumption that the function of interest (target function) depends linearly to the relevant variables.
Path Integral Control by Reproducing Kernel Hilbert Space Embedding
Rawlik, Konrad, Toussaint, Marc, Vijayakumar, Sethu
We present an embedding of stochastic optimal control problems, of the so called path integral form, into reproducing kernel Hilbert spaces. Using consistent, sample based estimates of the embedding leads to a model free, non-parametric approach for calculation of an approximate solution to the control problem. This formulation admits a decomposition of the problem into an invariant and task dependent component. Consequently, we make much more efficient use of the sample data compared to previous sample based approaches in this domain, e.g., by allowing sample re-use across tasks. Numerical examples on test problems, which illustrate the sample efficiency, are provided.
Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods
A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information - or patterns in general - about events emerging in real life based on the contents of this textual stream. We show that it is possible to extract valuable information about social phenomena, such as an epidemic or even rainfall rates, by automatic analysis of the content published in Social Media, and in particular Twitter, using Statistical Machine Learning methods. An important intermediate task regards the formation and identification of features which characterise a target event; we select and use those textual features in several linear, non-linear and hybrid inference approaches achieving a significantly good performance in terms of the applied loss function. By examining further this rich data set, we also propose methods for extracting various types of mood signals revealing how affective norms - at least within the social web's population - evolve during the day and how significant events emerging in the real world are influencing them. Lastly, we present some preliminary findings showing several spatiotemporal characteristics of this textual information as well as the potential of using it to tackle tasks such as the prediction of voting intentions.