Accuracy
Blockchains for Artificial Intelligence
And, it was first published on Dataconomy on Dec 21, 2016; I'm reposting here for ease of access.] In recent years, AI (artificial intelligence) researchers have finally cracked problems that they've worked on for decades, from Go to human-level speech recognition. A key piece was the ability to gather and learn on mountains of data, which pulled error rates past the success line. In short, big data has transformed AI, to an almost unreasonable level. Blockchain technology could transform AI too, in its own particular ways. Some applications of blockchains to AI are mundane, like audit trails on AI models. Some appear almost unreasonable, like AI that can own itself -- AI DAOs. All of them are opportunities. This article will explore these applications. Before we discuss applications, let's first review what's different about blockchains compared to traditional big-data distributed databases like MongoDB. We can think of blockchains as "blue ocean" databases: they escape the "bloody red ocean" of sharks competing in an existing market, opting instead to be in a blue ocean of uncontested market space.
Choosing a Machine Learning Classifier
How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you're simply looking for a "good enough" algorithm for your problem, or a place to start, here are some general guidelines I've found to work well over the years. If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren't powerful enough to provide accurate models.
NLP: Classification using a Naive Bayes classifier
Here is possible to find the application of the Naive Bayes approach to a specific problem: the classification of SMS into spam ("an undesired messages, e.g. The supporting code can be found here. The data used for such playground activity is the SMS Spam Collection v. 1, a public set of SMS messages that have been collected for mobile phone spam research where each message has been properly labeled as spam or ham. 'In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
Graph Structure Learning from Unlabeled Data for Event Detection
Somanchi, Sriram, Neill, Daniel B.
Processes such as disease propagation and information diffusion often spread over some latent network structure which must be learned from observation. Given a set of unlabeled training examples representing occurrences of an event type of interest (e.g., a disease outbreak), our goal is to learn a graph structure that can be used to accurately detect future events of that type. Motivated by new theoretical results on the consistency of constrained and unconstrained subset scans, we propose a novel framework for learning graph structure from unlabeled data by comparing the most anomalous subsets detected with and without the graph constraints. Our framework uses the mean normalized log-likelihood ratio score to measure the quality of a graph structure, and efficiently searches for the highest-scoring graph structure. Using simulated disease outbreaks injected into real-world Emergency Department data from Allegheny County, we show that our method learns a structure similar to the true underlying graph, but enables faster and more accurate detection.
Outlier Detection for Text Data : An Extended Version
Kannan, Ramakrishnan, Woo, Hyenkyun, Aggarwal, Charu C., Park, Haesun
The problem of outlier detection is extremely challenging in many domains such as text, in which the attribute values are typically non-negative, and most values are zero. In such cases, it often becomes difficult to separate the outliers from the natural variations in the patterns in the underlying data. In this paper, we present a matrix factorization method, which is naturally able to distinguish the anomalies with the use of low rank approximations of the underlying data. Our iterative algorithm TONMF is based on block coordinate descent (BCD) framework. We define blocks over the term-document matrix such that the function becomes solvable. Given most recently updated values of other matrix blocks, we always update one block at a time to its optimal. Our approach has significant advantages over traditional methods for text outlier detection. Finally, we present experimental results illustrating the effectiveness of our method over competing methods.
Machine Learning Walkthrough Part One: Preparing the Data
Cleaning and preparing data is a critical first step in any machine learning project. In this blog post, Dataquest student Daniel Osei's takes us through examining a dataset, selecting columns for features, exploring the data visually and then encoding the features for machine learning. This post is based on a Dataquest'Monthly Challenge', where our students are given a free-form task to complete. After first reading about Machine Learning on Quora in 2015, Daniel became excited at the prospect of an area that could combine his love of Mathematics and Programming. After reading this article on how to learn data science, Daniel started following the steps, eventually joining Dataquest to learn Data Science with us in in April 2016. We'd like to thank Daniel for his hard work, and generously letting us publish this post.
With AI2, Machine Learning and Analysts Come Together to Impress, Part 1: An Introduction
Machine learning is everywhere in the world of cybersecurity these days. It is often thought of as the magic bullet to secure systems and networks -- a tool able to identify previously invisible attacks through a nontransparent set of functions, as in neural nets. Transparency aside, neural nets and other algorithms have indeed proven very effective. Security professionals run into a distinct problem when attempting to do this, however. Machine learning classifiers perform much better in the supervised case, where labeled data is available.
Sparse model selection in the highly under-sampled regime
Bulso, Nicola, Marsili, Matteo, Roudi, Yasser
We propose a method for recovering the structure of a sparse undirected graphical model when very few samples are available. The method decides about the presence or absence of bonds between pairs of variable by considering one pair at a time and using a closed form formula, analytically derived by calculating the posterior probability for every possible model explaining a two body system using Jeffreys prior. The approach does not rely on the optimization of any cost functions and consequently is much faster than existing algorithms. Despite this time and computational advantage, numerical results show that for several sparse topologies the algorithm is comparable to the best existing algorithms, and is more accurate in the presence of hidden variables. We apply this approach to the analysis of US stock market data and to neural data, in order to show its efficiency in recovering robust statistical dependencies in real data with non-stationary correlations in time and/or space.
Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure
In co-expression analysis, the correlation between pairs of genes is typically combined into a network model of the correlation structure, which facilitates secondary network analysis such as community structure or centrality [1]. However, the correlation between pairs of genes in a co-expression network typically is assumed to be uniform across all samples (e.g., tissue types, treatment conditions, disease status, etc.). Yet it is often inter-group differences in correlated data that are of biological or clinical interest. For example, a gene co-expression network in microarray data for chronic lymphocytic leukemia using known biomarkers was able to predict treatment outcomes in an independent sample [2]. A differential co-expression network approach that leverages the genetic network information may yield novel biomarkers and improved prediction. Differential expression methods compute the mean difference between groups for each gene but typically do not incorporate conditional variation from other genes in the data that may help explain the between-group variation.
Stochastic Online AUC Maximization
Ying, Yiming, Wen, Longyin, Lyu, Siwei
Area under ROC (AUC) is a metric which is widely used for measuring the classification performance for imbalanced data. It is of theoretical and practical interest to develop online learning algorithms that maximizes AUC for large-scale data. A specific challenge in developing online AUC maximization algorithm is that the learning objective function is usually defined over a pair of training examples of opposite classes, and existing methods achieves on-line processing with higher space and time complexity. In this work, we propose a new stochastic online algorithm for AUC maximization. In particular, we show that AUC optimization can be equivalently formulated as a convex-concave saddle point problem. From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum. We establish theoretical convergence of SOLAM with high probability and demonstrate its effectiveness and efficiency on standard benchmark datasets.