Accuracy
Probabilistic Blocking with An Application to the Syrian Conflict
Steorts, Rebecca C., Shrivastava, Anshumali
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.
Applications of PageRank to Function Comparison and Malware Classification
Slawinski, Michael A., Wortman, Andy
We classify .NET files as either benign or malicious by examining certain directed graphs extracted from the files via decompilation. Each graph is viewed probabilistically as a Markov chain where each node heuristically represents the possible state of the running file, and by computing the PageRank vector (Perron vector with transport) we can assign a probability measure over the nodes of the given graph. We train a random forest with features derived from computing Lebesgue antiderivatives of functions defined over the vertex sets of the graphs listed above against the PageRank measure. The model was trained on 2.5 million samples of .NET and has an accuracy of 98.3\% on test data. The median time needed for decompilation and scoring was 24ms.
All about Naive Bayes โ Towards Data Science
Naive Bayes is the most simple algorithm that you can apply to your data. As the name suggests, here this algorithm makes an assumption as all the variables in the dataset is "Naive" i.e not correlated to each other. Naive Bayes is a very popular classification algorithm that is mostly used to get the base accuracy of the dataset. Let's assume that you are walking on the playground. Now you see some red object in front of you.
A Unified Dynamic Approach to Sparse Model Selection
Sparse model selection is ubiquitous from linear regression to graphical models where regularization paths, as a family of estimators upon the regularization parameter varying, are computed when the regularization parameter is unknown or decided data-adaptively. Traditional computational methods rely on solving a set of optimization problems where the regularization parameters are fixed on a grid that might be inefficient. In this paper, we introduce a simple iterative regularization path, which follows the dynamics of a sparse Mirror Descent algorithm or a generalization of Linearized Bregman Iterations with nonlinear loss. Its performance is competitive to \texttt{glmnet} with a further bias reduction. A path consistency theory is presented that under the Restricted Strong Convexity (RSC) and the Irrepresentable Condition (IRR), the path will first evolve in a subspace with no false positives and reach an estimator that is sign-consistent or of minimax optimal $\ell_2$ error rate. Early stopping regularization is required to prevent overfitting. Application examples are given in sparse logistic regression and Ising models for NIPS coauthorship.
How Alexa Is Learning to Converse More Naturally : Alexa Blogs
To handle more-natural spoken interactions, Alexa must track references through several rounds of conversation. If, for instance, a customer says, "How far is it to Redmond?" and after the answer follows up by saying, "Find good Indian restaurants there", Alexa should be able to infer that "there" refers to Redmond. We call the task of reference tracking "context carryover," and it's a capability that is currently being phased in to the Alexa experience. At this year's Interspeech, the largest conference on spoken-language understanding, my colleagues and I will present a paper titled "Contextual Slot Carryover for Disparate Schemas," which describes our solution to the problem of slot carryover, a crucial aspect of context carryover. "Domain" describes the type of application -- or "skill" -- that the utterance should invoke; for instance, mapping skills should answer questions about geographic distance.
Stanford AI detects even the smallest earthquakes from seismic data
Microearthquakes -- low-intensity earthquakes that register 2.0 or less magnitude on the moment magnitude scale -- rarely cause property damage. And as a result of background noise, small events, and false positives, they're not always picked up by seismic monitoring systems. A possible solution is described in a new paper from the Department of Geophysics at Stanford University, where scientists have developed an AI system -- dubbed Cnn-Rnn Earthquake Detector, or CRED -- that can isolate and identify a range of seismic signals from historical and continuous data. It builds on the work of Harvard and Google, which in August created an AI model capable of predicting the location of aftershocks up to one year after a major earthquake. The researchers' system consists of neural network layers -- interconnected processing nodes that loosely mimic the function of neurons in the brain -- of two types: convolutional neural networks and recurrent neural networks.
Fighting breast cancer with AI early detection Hack and Craft
Breast cancer awareness month is here and, with it, the latest statistics send a stark reminder of just how important early detection is in combating this brutal disease. With revolutionary strides forward in Artificial Intelligence (AI) all that looks set to change for the better. One of the leading causes of death for cancer patients is a late diagnosis, too often brought about by inferior testing facilities, human factors, such as fatigue and loss of concentration, or by the patients themselves, who put off seeing a specialist due to the fear of what they might discover. But now, thanks to nothing short of revolutionary strides forward in Artificial Intelligence (AI) all that looks set to change for the better. AI is capable of advanced learning using large complex datasets and has the potential to perform tasks such as image interpretation.
Text Classification of the Precursory Accelerating Seismicity Corpus: Inference on some Theoretical Trends in Earthquake Predictability Research from 1988 to 2018
Text analytics based on supervised machine learning classifiers has shown great promise in a multitude of domains, but has yet to be applied to Seismology. We test various standard models (Naive Bayes, k-Nearest Neighbors, Support Vector Machines, and Random Forests) on a seismological corpus of 100 articles related to the topic of precursory accelerating seismicity, spanning from 1988 to 2010. This corpus was labelled in Mignan (2011) with the precursor whether explained by critical processes (i.e., cascade triggering) or by other processes (such as signature of main fault loading). We investigate rather the classification process can be automatized to help analyze larger corpora in order to better understand trends in earthquake predictability research. We find that the Naive Bayes model performs best, in agreement with the machine learning literature for the case of small datasets, with cross-validation accuracies of 86% for binary classification. For a refined multiclass classification ('non-critical process' < 'agnostic' < 'critical process assumed' < 'critical process demonstrated'), we obtain up to 78% accuracy. Prediction on a dozen of articles published since 2011 shows however a weak generalization with a F1-score of 60%, only slightly better than a random classifier, which can be explained by a change of authorship and use of different terminologies. Yet, the model shows F1-scores greater than 80% for the two multiclass extremes ('non-critical process' versus 'critical process demonstrated') while it falls to random classifier results (around 25%) for papers labelled 'agnostic' or 'critical process assumed'. Those results are encouraging in view of the small size of the corpus and of the high degree of abstraction of the labelling. Domain knowledge engineering remains essential but can be made transparent by an investigation of Naive Bayes keyword posterior probabilities.
Reinforcement Learning with Perturbed Rewards
Wang, Jingkang, Liu, Yang, Li, Bo
Recent studies have shown the vulnerability of reinforcement learning (RL) models in noisy settings. The sources of noises differ across scenarios. For instance, in practice, the observed reward channel is often subject to noise (e.g., when observed rewards are collected through sensors), and thus observed rewards may not be credible as a result. Also, in applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors. In this paper, we consider noisy RL problems where observed rewards by RL agents are generated with a reward confusion matrix. We call such observed rewards as perturbed rewards. We develop an unbiased reward estimator aided robust RL framework that enables RL agents to learn in noisy environments while observing only perturbed rewards. Our framework draws upon approaches for supervised learning with noisy data. The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 67.5% and 46.7% improvements in average on five Atari games, when the error rates are 10% and 30% respectively.
Graphical Lasso and Thresholding: Equivalence and Closed-form Solutions
Fattahi, Salar, Sojoudi, Somayeh
Graphical Lasso (GL) is a popular method for learning the structure of an undirected graphical model, which is based on an $l_1$ regularization technique. The objective of this paper is to compare the computationally-heavy GL technique with a numerically-cheap heuristic method that is based on simply thresholding the sample covariance matrix. To this end, two notions of sign-consistent and inverse-consistent matrices are developed, and then it is shown that the thresholding and GL methods are equivalent if: (i) the thresholded sample covariance matrix is both sign-consistent and inverse-consistent, and (ii) the gap between the largest thresholded and the smallest un-thresholded entries of the sample covariance matrix is not too small. By building upon this result, it is proved that the GL method--as a conic optimization problem--has an explicit closed-form solution if the thresholded sample covariance matrix has an acyclic structure. This result is then generalized to arbitrary sparse support graphs, where a formula is found to obtain an approximate solution of GL. Furthermore, it is shown that the approximation error of the derived explicit formula decreases exponentially fast with respect to the length of the minimum-length cycle of the sparsity graph. The developed results are demonstrated on synthetic data, functional MRI data, traffic flows for transportation networks, and massive randomly generated datasets. We show that the proposed method can obtain an accurate approximation of the GL for instances with the sizes as large as $80,000\times 80,000$ (more than 3.2 billion variables) in less than 30 minutes on a standard laptop computer running MATLAB, while other state-of-the-art methods do not converge within 4 hours.