Goto

Collaborating Authors

 Regression


High-dimensional Index Volatility Models via Stein's Identity

arXiv.org Machine Learning

In this paper, we consider estimating the parametric components of index volatility models, whose variance function has semiparametric form with two common index structures: single index and multiple index. Our approach applies the first- and second-order Stein's identities on the empirical mean squared error (MSE) to extract the direction of true signals. We study both low-dimensional setting and high-dimensional setting under finite moment condition, which is weaker than existing literature and makes our estimators applicable even for some heavy-tailed data. From our theoretical analysis, we prove that the statistical rate of convergence has two components: parametric rate and nonparametric rate. For the parametric rate, we achieve $\sqrt{n}$-consistency for low-dimensional setting and optimal/sub-optimal rate for high-dimensional setting. For the nonparametric rate, we show it's asymptotically bounded by $n^{-4/5}$ under both settings when the mean function has bounded second derivative, so it only contributes high-order terms. Simulation results also back our theoretical conclusions.


Using machine learning for phishing domain detection [Tutorial] Packt Hub

#artificialintelligence

Social engineering is one of the most dangerous threats facing every individual and modern organization. Phishing is a well-known, computer-based, social engineering technique. Attackers use disguised email addresses as a weapon to target large companies. With the huge number of phishing emails received every day, companies are not able to detect all of them. That is why new techniques and safeguards are needed to defend against phishing.


Recovery guarantees for polynomial approximation from dependent data with outliers

arXiv.org Machine Learning

Learning non-linear systems from noisy, limited, and/or dependent data is an important task across various scientific fields including statistics, engineering, computer science, mathematics, and many more. In general, this learning task is ill-posed; however, additional information about the data's structure or on the behavior of the unknown function can make the task well-posed. In this work, we study the problem of learning nonlinear functions from corrupted and dependent data. The learning problem is recast as a sparse robust linear regression problem where we incorporate both the unknown coefficients and the corruptions in a basis pursuit framework. The main contribution of our paper is to provide a reconstruction guarantee for the associated $\ell_1$-optimization problem where the sampling matrix is formed from dependent data. Specifically, we prove that the sampling matrix satisfies the null space property and the stable null space property, provided that the data is compact and satisfies a suitable concentration inequality. We show that our recovery results are applicable to various types of dependent data such as exponentially strongly $\alpha$-mixing data, geometrically $\mathcal{C}$-mixing data, and uniformly ergodic Markov chain. Our theoretical results are verified via several numerical simulations.


Privacy-preserving Transfer Learning for Knowledge Sharing

arXiv.org Artificial Intelligence

In many practical machine-learning applications, it is critical to allow knowledge to be transferred from external domains while preserving user privacy. Unfortunately, existing transfer-learning works do not have a privacy guarantee. In this paper, for the first time, we propose a method that can simultaneously transfer knowledge from external datasets while offering an $\epsilon$-differential privacy guarantee. First, we show that a simple combination of the hypothesis transfer learning and the privacy preserving logistic regression can address the problem. However, the performance of this approach can be poor as the sample size in the target domain may be small. To address this problem, we propose a new method which splits the feature set in source and target data into several subsets, and trains models on these subsets before finally aggregating the predictions by a stacked generalization. Feature importance can also be incorporated into the proposed method to further improve performance. We prove that the proposed method has an $\epsilon$-differential privacy guarantee, and further analysis shows that its performance is better than above simple combination given the same privacy budget. Finally, experiments on MINST and real-world RUIJIN datasets show that our proposed method achieves the start-of-the-art performance.


Machine learning enables polymer cloud-point engineering via inverse design

arXiv.org Machine Learning

Inverse design is an outstanding challenge in disordered systems with multiple length scales such as polymers, particularly when designing polymers with desired phase behavior. We demonstrate high-accuracy tuning of poly(2-oxazoline) cloud point via machine learning. With a design space of four repeating units and a range of molecular masses, we achieve an accuracy of 4 {\deg}C root mean squared error (RMSE) in a temperature range of 24-90 {\deg}C, employing gradient boosting with decision trees. The RMSE is >3x better than linear and polynomial regression. We perform inverse design via particle-swarm optimization, predicting and synthesizing 17 polymers with constrained design at 4 target cloud points from 37 to 80 {\deg}C. Our approach challenges the status quo in polymer design with a machine learning algorithm, that is capable of fast and systematic discovery of new polymers.


Improving Grey-Box Fuzzing by Modeling Program Behavior

arXiv.org Artificial Intelligence

Grey-box fuzzers such as American Fuzzy Lop (AFL) are popular tools for finding bugs and potential vulnerabilities in programs. While these fuzzers have been able to find vulnerabilities in many widely used programs, they are not efficient; of the millions of inputs executed by AFL in a typical fuzzing run, only a handful discover unseen behavior or trigger a crash. The remaining inputs are redundant, exhibiting behavior that has already been observed. Here, we present an approach to increase the efficiency of fuzzers like AFL by applying machine learning to directly model how programs behave. We learn a forward prediction model that maps program inputs to execution traces, training on the thousands of inputs collected during standard fuzzing. This learned model guides exploration by focusing on fuzzing inputs on which our model is the most uncertain (measured via the entropy of the predicted execution trace distribution). By focusing on executing inputs our learned model is unsure about, and ignoring any input whose behavior our model is certain about, we show that we can significantly limit wasteful execution. Through testing our approach on a set of binaries released as part of the DARPA Cyber Grand Challenge, we show that our approach is able to find a set of inputs that result in more code coverage and discovered crashes than baseline fuzzers with significantly fewer executions.


Steerable Wavelet Scattering for 3D Atomic Systems with Application to Li-Si Energy Prediction

arXiv.org Machine Learning

A general machine learning architecture is introduced that uses wavelet scattering coefficients of an inputted three dimensional signal as features. Solid harmonic wavelet scattering transforms of three dimensional signals were previously introduced in a machine learning framework for the regression of properties of small organic molecules. Here this approach is extended for general steerable wavelets which are equivariant to translations and rotations, resulting in a sparse model of the target function. The scattering coefficients inherit from the wavelets invariance to translations and rotations. As an illustration of this approach a linear regression model is learned for the formation energy of amorphous lithium-silicon material states trained over a database generated using plane-wave Density Functional Theory methods. State-of-the-art results are produced as compared to other machine learning approaches over similarly generated databases.


Joint association and classification analysis of multi-view data

arXiv.org Machine Learning

Multi-view data, that is matched sets of measurements on the same subjects, have become increasingly common with technological advances in genomics and other fields. Often, the subjects are separated into known classes, and it is of interest to find associations between the views that are related to the class membership. Existing classification methods can either be applied to each view separately, or to the concatenated matrix of all views without taking into account between-views associations. On the other hand, existing association methods can not directly incorporate class information. In this work we propose a framework for Joint Association and Classification Analysis of multi-view data (JACA). We support the methodology with theoretical guarantees for estimation consistency in high-dimensional settings, and numerical comparisons with existing methods. In addition to joint learning framework, a distinct advantage of our approach is its ability to use partial information: it can be applied both in the settings with missing class labels, and in the settings with missing subsets of views. We apply JACA to colorectal cancer data from The Cancer Genome Atlas project, and quantify the association between RNAseq and miRNA views with respect to consensus molecular subtypes of colorectal cancer.


Model change detection with application to machine learning

arXiv.org Machine Learning

Throughout this paper, we use lower case letters to denote scalars and vectors, and use upper case letters to denote random variablesand matrices. We consider the model change detection problem in the following setting. ABSTRACT Model change detection is studied, in which there are two sets of samples that are independently and identically distributed (i.i.d.) according to a pre-change probabilistic model with parameter θ,and a post-change model with parameter θ The goal is to detect whether the change in the model is significant, i.e., whether the difference between the prechange parameterand the post-change parameter ‖θ θ The problem is considered in a Neyman-Pearson setting, where the goal is to maximize the probability of detection under a false alarm constraint. Since the generalized likelihood ratio test (GLRT) is difficult to compute in this problem, we construct an empirical differencetest (EDT), which approximates the GLRT and has low computational complexity. Moreover, we provide an approximation method to set the threshold of the EDT to meet the false alarm constraint.


Machine Learning Algorithms

#artificialintelligence

While Deeplearning4j and its suite of open-source libraries - ND4J, DataVec, Arbiter, etc. - primarily implement scalable, deep artificial neural networks, developers can also work with more traditional machine-learning algorithms using our framework. ND4J is a generic tensor library, so the sky's the limit on what can be implemented. We are integrating with Haifeng Li's SMILE, or Statistical Machine Intelligence and Learning Engine, which implements more than one hundred different statistical and machine-learning algorithms, including random forests and GBMs. SMILE shows the best performance of any open-source JVM-based machine-learning library we've seen. Below you'll find explanations of popular machine-learning algorithms and examples of how they are applied.