Goto

Collaborating Authors

 Accuracy


Determining Secondary Attributes for Credit Evaluation in P2P Lending

arXiv.org Machine Learning

There has been an increased need for secondary means of credit evaluation by both traditional banking organizations as well as peer-to-peer lending entities. This is especially important in the present technological era where sticking with strict primary credit histories doesn't help distinguish between a 'good' and a 'bad' borrower, and ends up hurting both the individual borrower as well as the investor as a whole. We utilized machine learning classification and clustering algorithms to accurately predict a borrower's creditworthiness while identifying specific secondary attributes that contribute to this score. While extensive research has been done in predicting when a loan would be fully paid, the area of feature selection for lending is relatively new. We achieved 65% F1 and 73% AUC on the LendingClub data while identifying key secondary attributes.


A Modified AUC for Training Convolutional Neural Networks: Taking Confidence into Account

arXiv.org Machine Learning

Receiver operating characteristic (ROC) curve is an informative tool in binary classification and Area Under ROC Curve (AUC) is a popular metric for reporting performance of binary classifiers. In this paper, first we present a comprehensive review of ROC curve and AUC metric. Next, we propose a modified version of AUC that takes confidence of the model into account and at the same time, incorporates AUC into Binary Cross Entropy (BCE) loss used for training a Convolutional neural Network for classification tasks. We demonstrate this on two datasets: MNIST and prostate MRI. Furthermore, we have published GenuineAI, a new python library, which provides the functions for conventional AUC and the proposed modified AUC along with metrics including sensitivity, specificity, recall, precision, and F1 for each point of the ROC curve.


Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability

arXiv.org Machine Learning

In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $\eta$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $\eta + \epsilon$ for any $\epsilon > 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/\epsilon)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves $\eta + \epsilon$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = \sigma(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $\sigma$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.


SEFR: A Fast Linear-Time Classifier for Ultra-Low Power Devices

arXiv.org Machine Learning

One of the fundamental challenges for running machine learning algorithms on battery-powered devices is the time and energy needed for computation, as these devices have constraints on resources. There are energy-efficient classifier algorithms, but their accuracy is often sacrificed for resource efficiency. Here, we propose an ultra-low power binary classifier, SEFR, with linear time complexity, both in the training and the testing phases. The SEFR method runs by creating a hyperplane to separate two classes. The weights of this hyperplane are calculated using normalization, and then the bias is computed based on the weights. SEFR is comparable to state-of-the-art classifiers in terms of classification accuracy, but its execution time and energy consumption are 11.02% and 8.67% of the average of state-of-the-art and baseline classifiers. The energy and memory consumption of SEFR is very insignificant, and it even can perform both train and test phases on microcontrollers. We have implemented SEFR on Arduino Uno, and on a dataset with 100 records and 100 features, the training time is 195 milliseconds, and testing for 100 records with 100 features takes 0.73 milliseconds. To the best of our knowledge, this is the first multipurpose algorithm specifically devised for learning on ultra-low power devices.


Efficient AutoML Pipeline Search with Matrix and Tensor Factorization

arXiv.org Artificial Intelligence

Chengrun Yang, Jicong Fan, Ziyang Wu, and Madeleine Udell This is an extended version of AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space (DOI: 10.1145/3394486.3403197) at the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020. Abstract--Data scientists seeking a good supervised learning model on a new dataset have many choices to make: they must preprocess the data, select features, possibly reduce the dimension, select an estimation algorithm, and choose hyperparameters for each of these pipeline components. With new pipeline components comes a combinatorial explosion in the number of choices! In this work, we design a new AutoML system to address this challenge: an automated system to design a supervised learning pipeline. Our system uses matrix and tensor factorization as surrogate models to model the combinatorial pipeline search space.


What needles do sparse neural networks find in nonlinear haystacks

arXiv.org Machine Learning

Using a sparsity inducing penalty in artificial neural networks (ANNs) avoids over-fitting, especially in situations where noise is high and the training set is small in comparison to the number of features. For linear models, such an approach provably also recovers the important features with high probability in regimes for a well-chosen penalty parameter. The typical way of setting the penalty parameter is by splitting the data set and performing the cross-validation, which is (1) computationally expensive and (2) not desirable when the data set is already small to be further split (for example, whole-genome sequence data). In this study, we establish the theoretical foundation to select the penalty parameter without cross-validation based on bounding with a high probability the infinite norm of the gradient of the loss function at zero under the zero-feature assumption. Our approach is a generalization of the universal threshold of Donoho and Johnstone (1994) to nonlinear ANN learning. We perform a set of comprehensive Monte Carlo simulations on a simple model, and the numerical results show the effectiveness of the proposed approach.


EPARS: Early Prediction of At-risk Students with Online and Offline Learning Behaviors

arXiv.org Artificial Intelligence

Early prediction of students at risk (STAR) is an effective and significant means to provide timely intervention for dropout and suicide. Existing works mostly rely on either online or offline learning behaviors which are not comprehensive enough to capture the whole learning processes and lead to unsatisfying prediction performance. We propose a novel algorithm (EPARS) that could early predict STAR in a semester by modeling online and offline learning behaviors. The online behaviors come from the log of activities when students use the online learning management system. The offline behaviors derive from the check-in records of the library. Our main observations are two folds. Significantly different from good students, STAR barely have regular and clear study routines. We devised a multi-scale bag-of-regularity method to extract the regularity of learning behaviors that is robust to sparse data. Second, friends of STAR are more likely to be at risk. We constructed a co-occurrence network to approximate the underlying social network and encode the social homophily as features through network embedding. To validate the proposed algorithm, extensive experiments have been conducted among an Asian university with 15,503 undergraduate students. The results indicate EPARS outperforms baselines by 14.62% ~ 38.22% in predicting STAR.


Neural Networks Out-of-Distribution Detection: Hyperparameter-Free Isotropic Maximization Loss, The Principle of Maximum Entropy, Cold Training, and Branched Inferences

arXiv.org Machine Learning

Current out-of-distribution detection (ODD) approaches present severe drawbacks that make impracticable their large scale adoption in real-world applications. In this paper, we propose a novel loss called Hyperparameter-Free IsoMax that overcomes these limitations. We modified the original IsoMax loss to improve ODD performance while maintaining benefits such as high classification accuracy, fast and energy-efficient inference, and scalability. The global hyperparameter is replaced by learnable parameters to increase performance. Additionally, a theoretical motivation to explain the high ODD performance of the proposed loss is presented. Finally, to keep high classification performance, slightly different inference mathematical expressions for classification and ODD are developed. No access to out-of-distribution samples is required, as there is no hyperparameter to tune. Our solution works as a straightforward SoftMax loss drop-in replacement that can be incorporated without relying on adversarial training or validation, model structure chances, ensembles methods, or generative approaches. The experiments showed that our approach is competitive against state-of-the-art solutions while avoiding their additional requirements and undesired side effects.


A coronavirus mystery: How many people in L.A. actually have COVID-19?

Los Angeles Times

One of the most pressing questions public health officials are trying to answer about the coronavirus is how many people actually have been infected by it. Have a relatively significant portion of Californians been infected with the virus but survived without much problem? Or has the virus touched only a tiny sliver of California, suggesting the chances of serious illness are greater if you're infected? In April, controversial studies out of Stanford University and USC suggested the coronavirus has circulated much more widely than previously thought in Silicon Valley and Los Angeles County. Almost immediately, there have been questions from other epidemiologists around the country about whether those estimates were too high.


Serology assays to manage COVID-19

Science

In late 2019, China reported a cluster of atypical pneumonia cases of unknown etiology in Wuhan. The causative agent was identified as a new betacoronavirus, called severe acute respiratory syndrome–coronavirus 2 (SARS-CoV-2), that causes coronavirus disease 2019 (COVID-19) (1). The virus rapidly spread across the globe and caused a pandemic. Sequencing of the viral genome allowed for the development of nucleic acid–based tests that have since been widely used for the diagnosis of acute (current) SARS-CoV-2 infections (2). Development of serological assays, which measure the antibody responses induced by SARS-CoV-2 infection (past but not current infections), took longer.