Country
Pattern Matching for Self- Tuning of MapReduce Jobs
Rizvandi, Nikzad Babaii, Taheri, Javid, Zomaya, Albert Y.
In this paper, we study CPU utilization time patterns of several MapReduce applications. After extracting running patterns of several applications, they are saved in a reference database to be later used to tweak system parameters to efficiently execute unknown applications in future. To achieve this goal, CPU utilization patterns of new applications are compared with the already known ones in the reference database to find/predict their most probable execution patterns. Because of different patterns lengths, the Dynamic Time Warping (DTW) is utilized for such comparison; a correlation analysis is then applied to DTWs outcomes to produce feasible similarity patterns. Three real applications (WordCount, Exim Mainlog parsing and Terasort) are used to evaluate our hypothesis in tweaking system parameters in executing similar applications. Results were very promising and showed effectiveness of our approach on pseudo-distributed MapReduce platforms.
English Sentence Recognition using Artificial Neural Network through Mouse-based Gestures
Handwriting is one of the most important means of daily communication. Although the problem of handwriting recognition has been considered for more than 60 years there are still many open issues, especially in the task of unconstrained handwritten sentence recognition. This paper focuses on the automatic system that recognizes continuous English sentence through a mouse-based gestures in real-time based on Artificial Neural Network. The proposed Artificial Neural Network is trained using the traditional backpropagation algorithm for self supervised neural network which provides the system with great learning ability and thus has proven highly successful in training for feed-forward Artificial Neural Network. The designed algorithm is not only capable of translating discrete gesture moves, but also continuous gestures through the mouse. In this paper we are using the efficient neural network approach for recognizing English sentence drawn by mouse. This approach shows an efficient way of extracting the boundary of the English Sentence and specifies the area of the recognition English sentence where it has been drawn in an image and then used Artificial Neural Network to recognize the English sentence. The proposed approach English sentence recognition (ESR) system is designed and tested successfully. Experimental results show that the higher speed and accuracy were examined.
Generalising unit-refutation completeness and SLUR via nested input resolution
Gwynne, Matthew, Kullmann, Oliver
We introduce two hierarchies of clause-sets, SLUR_k and UC_k, based on the classes SLUR (Single Lookahead Unit Refutation), introduced in 1995, and UC (Unit refutation Complete), introduced in 1994. The class SLUR, introduced in [Annexstein et al, 1995], is the class of clause-sets for which unit-clause-propagation (denoted by r_1) detects unsatisfiability, or where otherwise iterative assignment, avoiding obviously false assignments by look-ahead, always yields a satisfying assignment. It is natural to consider how to form a hierarchy based on SLUR. Such investigations were started in [Cepek et al, 2012] and [Balyo et al, 2012]. We present what we consider the "limit hierarchy" SLUR_k, based on generalising r_1 by r_k, that is, using generalised unit-clause-propagation introduced in [Kullmann, 1999, 2004]. The class UC, studied in [Del Val, 1994], is the class of Unit refutation Complete clause-sets, that is, those clause-sets for which unsatisfiability is decidable by r_1 under any falsifying assignment. For unsatisfiable clause-sets F, the minimum k such that r_k determines unsatisfiability of F is exactly the "hardness" of F, as introduced in [Ku 99, 04]. For satisfiable F we use now an extension mentioned in [Ansotegui et al, 2008]: The hardness is the minimum k such that after application of any falsifying partial assignments, r_k determines unsatisfiability. The class UC_k is given by the clause-sets which have hardness <= k. We observe that UC_1 is exactly UC. UC_k has a proof-theoretic character, due to the relations between hardness and tree-resolution, while SLUR_k has an algorithmic character. The correspondence between r_k and k-times nested input resolution (or tree resolution using clause-space k+1) means that r_k has a dual nature: both algorithmic and proof theoretic. This corresponds to a basic result of this paper, namely SLUR_k = UC_k.
Efficient Sparse Group Feature Selection via Nonconvex Optimization
Xiang, Shuo, Shen, Xiaotong, Ye, Jieping
Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. While promising, most of the existing works use convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contributions of this article are twofold: (1) statistically, we introduce a nonconvex sparse group feature selection model which can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter estimation can be achieved; (2) computationally, we propose an efficient algorithm that is applicable to large-scale problems. Numerical results suggest that the proposed nonconvex method compares favorably against its competitors on synthetic data and real-world applications, thus achieving desired goal of delivering high performance.
MANCaLog: A Logic for Multi-Attribute Network Cascades (Technical Report)
Shakarian, Paulo, Simari, Gerardo I., Schroeder, Robert
The modeling of cascade processes in multi-agent systems in the form of complex networks has in recent years become an important topic of study due to its many applications: the adoption of commercial products, spread of disease, the diffusion of an idea, etc. In this paper, we begin by identifying a desiderata of seven properties that a framework for modeling such processes should satisfy: the ability to represent attributes of both nodes and edges, an explicit representation of time, the ability to represent non-Markovian temporal relationships, representation of uncertain information, the ability to represent competing cascades, allowance of non-monotonic diffusion, and computational tractability. We then present the MANCaLog language, a formalism based on logic programming that satisfies all these desiderata, and focus on algorithms for finding minimal models (from which the outcome of cascades can be obtained) as well as how this formalism can be applied in real world scenarios. We are not aware of any other formalism in the literature that meets all of the above requirements.
Applying machine learning techniques to improve user acceptance on ubiquitous environement
Ubiquitous information access becomes more and more important nowadays and research is aimed at making it adapted to users. Our work consists in applying machine learning techniques in order to adapt the information access provided by ubiquitous systems to users when the system only knows the user social group, without knowing anything about the user interest. The adaptation procedures associate actions to perceived situations of the user. Associations are based on feedback given by the user as a reaction to the behavior of the system. Our method brings a solution to some of the problems concerning the acceptance of the system by users when applying machine learning techniques to systems at the beginning of the interaction between the system and the user.
Follow the Leader If You Can, Hedge If You Must
de Rooij, Steven, van Erven, Tim, Grünwald, Peter D., Koolen, Wouter M.
Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has terrible performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As part of our construction, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour and Stoltz (2007), yielding slightly improved worst-case guarantees. By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains.
Non-parametric Bayesian modelling of digital gene expression data
Vavoulis, Dimitrios V., Gough, Julian
Next-generation sequencing technologies provide a revolutionary tool for generating gene expression data. Starting with a fixed RNA sample, they construct a library of millions of differentially abundant short sequence tags or "reads", which constitute a fundamentally discrete measure of the level of gene expression. A common limitation in experiments using these technologies is the low number or even absence of biological replicates, which complicates the statistical analysis of digital gene expression data. Analysis of this type of data has often been based on modified tests originally devised for analysing microarrays; both these and even de novo methods for the analysis of RNA-seq data are plagued by the common problem of low replication. We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. We begin with a hierarchical model for modelling over-dispersed count data and a blocked Gibbs sampling algorithm for inferring the posterior distribution of model parameters conditional on these counts. The algorithm compensates for the problem of low numbers of biological replicates by clustering together genes with tag counts that are likely sampled from a common distribution and using this augmented sample for estimating the parameters of this distribution. The number of clusters is not decided a priori, but it is inferred along with the remaining model parameters. We demonstrate the ability of this approach to model biological data with high fidelity by applying the algorithm on a public dataset obtained from cancerous and non-cancerous neural tissues.
Robust PCA and subspace tracking from incomplete observations using L0-surrogates
Hage, Clemens, Kleinsteuber, Martin
Many applications in data analysis rely on the decomposition of a data matrix into a low-rank and a sparse component. Existing methods that tackle this task use the nuclear norm and L1-cost functions as convex relaxations of the rank constraint and the sparsity measure, respectively, or employ thresholding techniques. We propose a method that allows for reconstructing and tracking a subspace of upper-bounded dimension from incomplete and corrupted observations. It does not require any a priori information about the number of outliers. The core of our algorithm is an intrinsic Conjugate Gradient method on the set of orthogonal projection matrices, the so-called Grassmannian. Non-convex sparsity measures are used for outlier detection, which leads to improved performance in terms of robustly recovering and tracking the low-rank matrix. In particular, our approach can cope with more outliers and with an underlying matrix of higher rank than other state-of-the-art methods.
Affinity Weighted Embedding
Weston, Jason, Weiss, Ron, Yee, Hector
Supervised (linear) embedding models like Wsabie and PSI have proven successful at ranking, recommendation and annotation tasks. However, despite being scalable to large datasets they do not take full advantage of the extra data due to their linear nature, and typically underfit. We propose a new class of models which aim to provide improved performance while retaining many of the benefits of the existing class of embedding models. Our new approach works by iteratively learning a linear embedding model where the next iteration's features and labels are reweighted as a function of the previous iteration. We describe several variants of the family, and give some initial results.