Goto

Collaborating Authors

 Accuracy


Spatial Semantic Scan: Jointly Detecting Subtle Events and their Spatial Footprint

arXiv.org Machine Learning

Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan (Liu and Neill (2011)) to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods (Shao et al. (2011)) with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data.


Naรฏve-Bayes Technique for Machine Learning

#artificialintelligence

"We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances." "When you have two competing theories that make exactly the same predictions, the simpler one is the better." One famous example of Occam's Razor in action is found in conspiracy theories surrounding the NASA moon landings. Many conspiracy theorists believe that the first Moon Landing was staged and filmed in a studio, part of an elaborate hoax. Their justification relies upon many twisted and convoluted theories, whereas the NASA argument is fairly straightforward.


Structure Learning of Partitioned Markov Networks

arXiv.org Machine Learning

We learn the structure of a Markov Network between two groups of random variables from joint observations. Since modelling and learning the full MN structure may be hard, learning the links between two groups directly may be a preferable option. We introduce a novel concept called the \emph{partitioned ratio} whose factorization directly associates with the Markovian properties of random variables across two groups. A simple one-shot convex optimization procedure is proposed for learning the \emph{sparse} factorizations of the partitioned ratio and it is theoretically guaranteed to recover the correct inter-group structure under mild conditions. The performance of the proposed method is experimentally compared with the state of the art MN structure learning methods using ROC curves. Real applications on analyzing bipartisanship in US congress and pairwise DNA/time-series alignments are also reported.


ProtVec: A Continuous Distributed Representation of Biological Sequences

arXiv.org Artificial Intelligence

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%+-0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined.


Evasion and Hardening of Tree Ensemble Classifiers

arXiv.org Machine Learning

Classifier evasion consists in finding for a given instance $x$ the nearest instance $x'$ such that the classifier predictions of $x$ and $x'$ are different. We present two novel algorithms for systematically computing evasions for tree ensembles such as boosted trees and random forests. Our first algorithm uses a Mixed Integer Linear Program solver and finds the optimal evading instance under an expressive set of constraints. Our second algorithm trades off optimality for speed by using symbolic prediction, a novel algorithm for fast finite differences on tree ensembles. On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions. Finally, we harden a boosted tree model without loss of predictive accuracy by augmenting the training set of each boosting round with evading instances, a technique we call adversarial boosting.


Exact Exponent in Optimal Rates for Crowdsourcing

arXiv.org Machine Learning

In many machine learning applications, crowdsourcing has become the primary means for label collection. In this paper, we study the optimal error rate for aggregating labels provided by a set of non-expert workers. Under the classic Dawid-Skene model, we establish matching upper and lower bounds with an exact exponent $mI(\pi)$ in which $m$ is the number of workers and $I(\pi)$ the average Chernoff information that characterizes the workers' collective ability. Such an exact characterization of the error exponent allows us to state a precise sample size requirement $m>\frac{1}{I(\pi)}\log\frac{1}{\epsilon}$ in order to achieve an $\epsilon$ misclassification error. In addition, our results imply the optimality of various EM algorithms for crowdsourcing initialized by consistent estimators.


A Novel Approach for Stable Selection of Informative Redundant Features from High Dimensional fMRI Data

arXiv.org Machine Learning

Numerous functional imaging studies have reported neural activities during the experience of specific emotions or cognitive activities and demonstrated the potentials of functional imaging MRI for the classification of cognitive states or identification of mental disorders. In this paper, we consider learning from fMRI data as a pattern recognition problem and mainly focus on how to accurately and stably identify the relevant features (either voxels or network connections) that participate in a given cognitive task or that are closely related with certain mental disorders. In this paper, we will mainly consider the binary classification problems such as discriminating patients of certain mental discorder from the normal persons or classifying different cognative states, though the proposed idea can also be extended to the case of regression As we know, with the rapid development of data capture and storage technologies, the "curse of dimensionality" becomes a common issue in many fields [1] including the field of pattern recognition and machine learning, where "curse of dimensionality" often refers to an extremely high dimensional feature space. Therefore, feature selection, as a way of dimensional reduction, is critical in many pattern recognition applications such as medical image analysis, computer vision, speech recognition and many more [2]. In this paper, we consider the related challeges in the neuroimaging data based pattern recognition, where besides the "curse of dimensionality", feature selection has another common difficulty, which lies in the small number of training samples, due to varied reasons.


Highly Accurate Prediction of Jobs Runtime Classes

arXiv.org Machine Learning

Supplying job schedulers with information on how long the jobs are expected to run enabled the development of the backfilling algorithms, which leverage this information to pack the jobs more efficiently and improve system utilization [1]. These algorithms, however, were designed for parallel systems, in which the jobs require many processors in order to execute, and processor fragmentation (idleness) is a big concern. In those environments the scheduler needs to know the actual runtimes of the jobs (use numeric predictions) to be able to optimize the schedule and improve performance [10]. Our work targets systems in which most jobs are serial, like server farms that are used for software testing. In those environments sophisticated scheduling algorithms are not required, and in order to improve performance it is enough to simply separate the short jobs from the long and assign them to different queues in the system [12]. This separation reduces the likelihood that short jobs will be delayed after long ones, improves the average turnaround times of the jobs and overall system throughput.


Multivariate data visualization

@machinelearnbot

As a fraud practitioner using data mining techniques to detect fraud, anomalies, outliers or other indicators of potential problems I use a combination of data mining and data matching techniques. The volumes of data in a client assignment can vary from 15 million records of company directors, 60,000 employees, accounts payables data of suppliers 900,000 and invoice transaction 11,million. I'm not a great fan of predictive technologies as the disparate data sets don't seem to fit with the techniques, but I'm open to alternative methodologies. I've recently tested a single fraud profile using "Receiver Operating Characteristic" to evaluate the sensitivity and specificity of the profile. The results fell within the ROC space.


Predicting winners of the Rugby World Cup

#artificialintelligence

For the sake of brevity, not all the relevant data and code are displayed in this post but can rather be found here. And you can visit the final working web application here. The Rugby World Cup (RWC) is here! With many fans around the world excited to see the action unfold over the next month and a half. If you've never heard of the sport, whatisrugby.com