Goto

Collaborating Authors

 Regression


Identifying Genetic Risk Factors via Sparse Group Lasso with Group Graph Structure

arXiv.org Machine Learning

Genome-wide association studies (GWA studies or GWAS) investigate the relationships between genetic variants such as single-nucleotide polymorphisms (SNPs) and individual traits. Recently, incorporating biological priors together with machine learning methods in GWA studies has attracted increasing attention. However, in real-world, nucleotide-level bio-priors have not been well-studied to date. Alternatively, studies at gene-level, for example, protein--protein interactions and pathways, are more rigorous and legitimate, and it is potentially beneficial to utilize such gene-level priors in GWAS. In this paper, we proposed a novel two-level structured sparse model, called Sparse Group Lasso with Group-level Graph structure (SGLGG), for GWAS. It can be considered as a sparse group Lasso along with a group-level graph Lasso. Essentially, SGLGG penalizes the nucleotide-level sparsity as well as takes advantages of gene-level priors (both gene groups and networks), to identifying phenotype-associated risk SNPs. We employ the alternating direction method of multipliers algorithm to optimize the proposed model. Our experiments on the Alzheimer's Disease Neuroimaging Initiative whole genome sequence data and neuroimage data demonstrate the effectiveness of SGLGG. As a regression model, it is competitive to the state-of-the-arts sparse models; as a variable selection method, SGLGG is promising for identifying Alzheimer's disease-related risk SNPs.


The math of neural networks

#artificialintelligence

Building neural networks is at the heart of any deep learning technique. Neural networks is a series of forward and backward propagations to train paramters in the model, and it is built on the unit of logistic regression classifiers. This post will expand based on the math of logistic regression to build more advanced neural networks in mathematical terms. A neural network is composed of layers, and there are three types of layers in a neural network: one input layer, one output layer, and one or many hidden layers. Each layer is built based on the same structure of logistic regression classifier, with a linear transformation and an activation function. Given a fixed set of input layer and output layer, we can build more complex neural network by adding more hidden layers.


Python Overtaking R?

@machinelearnbot

I just read two articles that claim that Python is overtaking R for data science and machine learning. From user comments, I learned that R is still strong in certain tasks. I will survey what these tasks are. The first article by Vincent Granville from DSC uses proxy metrics (as opposed to asking the users). He uses statistics from Google Trends, Indeed job search terms, and Analytic Talent (DSC job database) to conclude that Python has overtaken R. One is led to ask if one group of users (say Python's) is a more active googler.


Building Machine Learning Model is fun using Orange - Analytics Vidhya

@machinelearnbot

In the growing market of Data Science, there are quite some details that people miss out on. These are tools or techniques that can make you a better performer in the field and also ease your efforts and help you focus on the analytics rather than the trivialities. Here, I will introduce you to another GUI based tool โ€“ Orange. This tool is great for beginners who wish to visualize patterns and understand their data without really knowing how to code. In my previous article, I presented you another GUI based tool KNIME, follow this link to learn about it further.


Crowdsourcing Predictors of Residential Electric Energy Usage

arXiv.org Machine Learning

Crowdsourcing has been successfully applied in many domains including astronomy, cryptography and biology. In order to test its potential for useful application in a Smart Grid context, this paper investigates the extent to which a crowd can contribute predictive hypotheses to a model of residential electric energy consumption. In this experiment, the crowd generated hypotheses about factors that make one home different from another in terms of monthly energy usage. To implement this concept, we deployed a web-based system within which 627 residential electricity customers posed 632 questions that they thought predictive of energy usage. While this occurred, the same group provided 110,573 answers to these questions as they accumulated. Thus users both suggested the hypotheses that drive a predictive model and provided the data upon which the model is built. We used the resulting question and answer data to build a predictive model of monthly electric energy consumption, using random forest regression. Because of the sparse nature of the answer data, careful statistical work was needed to ensure that these models are valid. The results indicate that the crowd can generate useful hypotheses, despite the sparse nature of the dataset.


Feature selection in high-dimensional dataset using MapReduce

arXiv.org Machine Learning

The exponential growth of data generation, measurements and collection in scientific and engineering disciplines leads to the availability of huge and high-dimensional datasets, in domains as varied as text mining, social network, astronomy or bioinformatics to name a few. The only viable path to the analysis of such datasets is to rely on data-intensive distributed computing frameworks [1]. MapReduce has in the last decade established itself as a reference programming model for distributed computing. The model is articulated around two main classes of functions, mappers and reducers, which greatly decrease the complexity of a distributed program while allowing to express a wide range of computing tasks. MapReduce was popularised by Google research in 2008 [2], and may be executed on parallel computing platforms ranging from specialised hardware units such as parallel field programmable gate arrays (FPGAs) and graphics processing units, to large clusters of commodity machine using for example the Hadoop or Spark frameworks [2]-[4]. In particular, the expressiveness of the MapReduce programming model has led to the design of advanced distributed data processing libraries for machine learning and data mining, such as Hadoop Mahout and Spark MLlib. Many of the standard supervised and unsupervised learning techniques (linear and logistic regression, naive Bayes, SVM, random forest, PCA) are now available from these libraries [5]-[7]. Little attention has however yet been given to feature selection algorithms (FSA), which form an essential component of machine learning and data mining workflows. Besides reducing a dataset size, FSA also generally allow to improve the performance of classification and regression models by selecting the most relevant features and reducing the noise in a dataset [8].


Time Series Forecasting With Prophet

@machinelearnbot

Prophet is an open source forecasting tool built by Facebook. It can be used for time series modeling and forecasting trends into the future. Prophet is interesting because it's both sophisticated and quite easy to use, so it's possible to generate very good forecasts with relatively little effort or domain knowledge in time series analysis. There are a few requirements you'll need to meet in order to use the library. It uses PyStan to do all of its inference, so PyStan has to be installed.


Optimal Sub-sampling with Influence Functions

arXiv.org Machine Learning

Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the concept of an asymptotically linear estimator and the associated influence function leads to optimal sampling procedures for a wide class of popular models. Furthermore, for linear regression models which have well-studied procedures for non-uniform sub-sampling, we show our optimal influence function based method outperforms previous approaches. We empirically show the improved performance of our method on real datasets.


Visualizing Cross-validation Code

@machinelearnbot

Let's visualize to improve your prediction... Let us say, you are writing a nice and clean Machine Learning code (e.g. You code is OK, first you divided your dataset into two parts, "Training Set and Testing Set" as usual with the function like train_test_split and with some random factor. Your prediction could be slightly under or overfit, like the figures below. As the name of the suggests, cross-validation is the next fun thing after learning Linear Regression because it helps to improve your prediction using the K-Fold strategy. What is K-Fold you asked?


Ford Motor Credit tests AI's ability to spot overlooked borrowers

#artificialintelligence

Jim Moynes, vice president of risk management at Ford Motor Credit in Dearborn, Mich., first became interested in using machine learning to improve car loan underwriting several years ago. "We were watching what others were working on," he said. "We like to be innovative and try to stay up with what's going on." The company recently ran an experiment to see if machine learning could help its underwriters better understand the loan applications it receives. It was a champion vs. challenger test: Moynes' team took several years of loan data, removed all personally identifiable information from it, and gave it to ZestFinance, a provider of machine-learning-based online lending software, and its own modeling team, which creates logistic regression models to predict potential borrowers' creditworthiness. Each team ran the loan application data through its models and predicted the future performance of the loans.