Goto

Collaborating Authors

 Performance Analysis


PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets

arXiv.org Machine Learning

Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs which play a significant role in several biological processes. RNA-seq based transcriptome sequencing has been extensively used for identification of lncRNAs. However, accurate identification of lncRNAs in RNA-seq datasets is crucial for exploring their characteristic functions in the genome as most coding potential computation (CPC) tools fail to accurately identify them in transcriptomic data. Well-known CPC tools such as CPC2, lncScore, CPAT are primarily designed for prediction of lncRNAs based on the GENCODE, NONCODE and CANTATAdb databases. The prediction accuracy of these tools often drops when tested on transcriptomic datasets. This leads to higher false positive results and inaccuracy in the function annotation process. In this study, we present a novel tool, PLIT, for the identification of lncRNAs in plants RNA-seq datasets. PLIT implements a feature selection method based on L1 regularization and iterative Random Forests (iRF) classification for selection of optimal features. Based on sequence and codon-bias features, it classifies the RNA-seq derived FASTA sequences into coding or long non-coding transcripts. Using L1 regularization, 31 optimal features were obtained based on lncRNA and protein-coding transcripts from 8 plant species. The performance of the tool was evaluated on 7 plant RNA-seq datasets using 10-fold cross-validation. The analysis exhibited superior accuracy when evaluated against currently available state-of-the-art CPC tools.


A Probabilistic Framework to Node-level Anomaly Detection in Communication Networks

arXiv.org Machine Learning

Abstract--In this paper we consider the task of detecting abnormal communication volume occurring at node-level in communication networks. The signal of the communication activity is modeled by means of a clique stream: each occurring communication event is instantaneous and activates an undirected subgraph spanning over a set of equally participating nodes. We present a probabilistic framework to model and assess the communication volume observed at any single node. Specifically, we employ nonparametric regression to learn the probability that a node takes part in a certain event knowing the set of other nodes that are involved. On the top of that, we present a concentration inequality around the estimated volume of events in which a node could participate, which in turn allows us to build an efficient and interpretable anomaly scoring function. Finally, the superior performance of the proposed approach is empirically demonstrated in real-world sensor network data, as well as using synthetic communication activity that is in accordance with that latter setting. I. INTRODUCTION Monitoring the activity in communication networks has become a popular area of research and particular attention has been paid to detection tasks such as spotting events or anomalies. Aneffective way to represent the communication activity is via a dynamic graph where the entities are considered to be nodes, and each communication event (or more simply event) to be represented by a set of connecting edges that appear at a specific time interval.


A Machine Learning based Robust Prediction Model for Real-life Mobile Phone Data

arXiv.org Machine Learning

Real-life mobile phone data may contain noisy instances, which is a fundamental issue for building a prediction model with many potential negative consequences. The complexity of the inferred model may increase, may arise overfitting problem, and thereby the overall prediction accuracy of the model may decrease. In this paper, we address these issues and present a robust prediction model for real-life mobile phone data of individual users, in order to improve the prediction accuracy of the model. In our robust model, we first effectively identify and eliminate the noisy instances from the training dataset by determining a dynamic noise threshold using naive Bayes classifier and laplace estimator, which may differ from user-to-user according to their unique behavioral patterns. After that, we employ the most popular rule-based machine learning classification technique, i.e., decision tree, on the noise-free quality dataset to build the prediction model. Experimental results on the real-life mobile phone datasets (e.g., phone call log) of individual mobile phone users, show the effectiveness of our robust model in terms of precision, recall and f-measure.


Yelp Food Identification via Image Feature Extraction and Classification

arXiv.org Machine Learning

Yelp has been one of the most popular local service search engine in US since 2004. It is powered by crowd-sourced text reviews and photo reviews. Restaurant customers and business owners upload photo images to Yelp, including reviewing or advertising either food, drinks, or inside and outside decorations. It is obviously not so effective that labels for food photos rely on human editors, which is an issue should be addressed by innovative machine learning approaches. In this paper, we present a simple but effective approach which can identify up to ten kinds of food via raw photos from the challenge dataset. We use 1) image pre-processing techniques, including filtering and image augmentation, 2) feature extraction via convolutional neural networks (CNN), and 3) three ways of classification algorithms. Then, we illustrate the classification accuracy by tuning parameters for augmentations, CNN, and classification. Our experimental results show this simple but effective approach to identify up to 10 food types from images.


KTBoost: Combined Kernel and Tree Boosting

arXiv.org Machine Learning

Boosting algorithms [Freund et al., 1996, Friedman et al., 2000, Mason et al., 2000, Friedman, 2001,Bรผhlmann and Hothorn, 2007] enjoy large popularity in both applied data analysis and machine learning research due to their high predictive accuracy observed on a wide range of data sets [Chen and Guestrin, 2016]. Boosting additively combines base learners by sequentially minimizing a risk functional. To the best of our knowledge, except for one reference [Hothorn et al., 2010], the large majority of boosting algorithms use only one type of functions as base learners. In this article, we relax this assumption by combining trees[Breiman et al., 1984] and reproducing kernel Hilbert space (RKHS) regression functions [Schรถlkopf and Smola, 2001, Berlinet and Thomas-Agnan, 2011] as base learners, and we empirically show that this combination of different base learners results in increased predictive accuracy compared to both only tree and kernel boosting. To date, regression trees are the most common choice of base learners for boosting in both applied data analysis and machine learning research. In particular, a lot of effort has been made in recent years to develop tree-based boosting methods that scale to large data [Chen and Guestrin, 2016, Ke et al., 2017, Ponomareva et al., 2017, Prokhorenkova et al., 2018]. On the other hand, kernel machines show state-of-the-art predictive accuracy for many data sets as they can, for instance, achieve near-optimal test error [Belkin et al., 2018b,a], and kernel classifiers parallel the behaviors of deep networks as noted in Zhang


Reconstructing dynamical networks via feature ranking

arXiv.org Machine Learning

Empirical data on real complex systems are becoming increasingly available. Parallel to this is the need for new methods of reconstructing (inferring) the topology of networks from time-resolved observations of their node-dynamics. The methods based on physical insights often rely on strong assumptions about the properties and dynamics of the scrutinized network. Here, we use the insights from machine learning to design a new method of network reconstruction that essentially makes no such assumptions. Specifically, we interpret the available trajectories (data) as features, and use two independent feature ranking approaches -- Random forest and RReliefF -- to rank the importance of each node for predicting the value of each other node, which yields the reconstructed adjacency matrix. We show that our method is fairly robust to coupling strength, system size, trajectory length and noise. We also find that the reconstruction quality strongly depends on the dynamical regime.


Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops

arXiv.org Machine Learning

Why is humor so difficult for machine learning and AI systems to understand? In light of recent studies in Psychology showing that individual words can be humorous Engelthaler & Hills (2017); Westbury et al. (2016), and in light of the fact that Word Embeddings (WEs) have been to shown to capture numerous properties of words (e.g., Mikolov et al., 2013), it is natural to study if and how WEs capture humor. First, we find that individual-word humor possesses many aspects of humor that have been discussed in general theories of humor, and that many of these aspects of humor are captured by WEs. To more deeply understand which features of humor WEs capture and to what extent, we draw on existing theories of humor to define a number of candidate features of word humor. Interestingly, many of these theories can be applied to word humor.


Crime Linkage Detection by Spatio-Temporal-Textual Point Processes

arXiv.org Machine Learning

Crimes emerge out of complex interactions of behaviors and situations; thus there are complex linkages between crime incidents. Solving the puzzle of crime linkage is a highly challenging task because we often only have limited information from indirect observations such as records, text descriptions, and associated time and locations. We propose a new modeling and learning framework for detecting linkage between crime events using \textit{spatio-temporal-textual} data, which are highly prevalent in the form of police reports. We capture the notion of \textit{modus operandi} (M.O.), by introducing a multivariate marked point process and handling the complex text jointly with the time and location. The model is able to discover the latent space that links the crime series. The model fitting is achieved by a computationally efficient Expectation-Maximization (EM) algorithm. In addition, we explicitly reduce the bias in the text documents in our algorithm. Our numerical results using real data from the Atlanta Police show that our method has competitive performance relative to the state-of-the-art. Our results, including variable selection, are highly interpretable and may bring insights into M.O. extraction.


ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

arXiv.org Machine Learning

This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality.


Identifying Fake News from Twitter Sharing Data: A Large-Scale Study

arXiv.org Machine Learning

Social networks offer a ready channel for fake and misleading news to spread and exert influence. This paper examines the performance of different reputation algorithms when applied to a large and statistically significant portion of the news that are spread via Twitter. Our main result is that simple crowdsourcing-based algorithms are able to identify a large portion of fake or misleading news, while incurring only very low false positive rates for mainstream websites. We believe that these algorithms can be used as the basis of practical, large-scale systems for indicating to consumers which news sites deserve careful scrutiny and skepticism.