Goto

Collaborating Authors

 Regression


Identifying tumor cells at the single-cell level using machine learning - Genome Biology

#artificialintelligence

Cancer is a disease that stems from the disruption of cellular state. Through genetic perturbations, tumor cells attain cellular states that give them proliferative advantage over the surrounding normal tissue [1]. The inherent variability of this process has hampered efforts to find highly effective common therapies, thereby ushering the need for precision medicine [2]. The scale of single-cell experiments is poised to revolutionize personalized medicine by effective characterization of the complete heterogeneity within a tumor for each individual patient [3, 4]. Recent expansion of single-cell sequencing technologies has exponentially increased the scale of knowledge attainable through a single biological experiment [5].


System Norm Regularization Methods for Koopman Operator Approximation

arXiv.org Artificial Intelligence

Approximating the Koopman operator from data is numerically challenging when many lifting functions are considered. Even low-dimensional systems can yield unstable or ill-conditioned results in a high-dimensional lifted space. In this paper, Extended Dynamic Mode Decomposition (DMD) and DMD with control, two methods for approximating the Koopman operator, are reformulated as convex optimization problems with linear matrix inequality constraints. Asymptotic stability constraints and system norm regularizers are then incorporated as methods to improve the numerical conditioning of the Koopman operator. Specifically, the H-infinity norm is used to penalize the input-output gain of the Koopman system. Weighting functions are then applied to penalize the system gain at specific frequencies. These constraints and regularizers introduce bilinear matrix inequality constraints to the regression problem, which are handled by solving a sequence of convex optimization problems. Experimental results using data from an aircraft fatigue structural test rig and a soft robot arm highlight the advantages of the proposed regression methods.


Copulaboost: additive modeling with copula-based model components

arXiv.org Machine Learning

We propose a type of generalised additive models with of model components based on pair-copula constructions, with prediction as a main aim. The model components are designed such that our model may capture potentially complex interaction effects in the relationship between the response covariates. In addition, our model does not require discretisation of continuous covariates, and is therefore suitable for problems with many such covariates. Further, we have designed a fitting algorithm inspired by gradient boosting, as well as efficient procedures for model selection and evaluation of the model components, through constraints on the model space and approximations, that speed up time-costly computations. In addition to being absolutely necessary for our model to be a realistic alternative in higher dimensions, these techniques may also be useful as a basis for designing efficient models selection algorithms for other types of copula regression models. We have explored the characteristics of our method in a simulation study, in particular comparing it to natural alternatives, such as logic regression, classic boosting models and penalised logistic regression. We have also illustrated our approach on the Wisconsin breast cancer dataset and on the Boston housing dataset. The results show that our method has a prediction performance that is either better than or comparable to the other methods, even when the proportion of discrete covariates is high.


A review on longitudinal data analysis with random forest in precision medicine

arXiv.org Artificial Intelligence

Precision medicine provides customized treatments to patients based on their characteristics and is a promising approach to improving treatment efficiency. Large scale omics data are useful for patient characterization, but often their measurements change over time, leading to longitudinal data. Random forest is one of the state-of-the-art machine learning methods for building prediction models, and can play a crucial role in precision medicine. In this paper, we review extensions of the standard random forest method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate responses and further categorize the repeated measurements according to whether the time effect is relevant. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.


EFI: A Toolbox for Feature Importance Fusion and Interpretation in Python

arXiv.org Artificial Intelligence

This paper presents an open-source Python toolbox called Ensemble Feature Importance (EFI) to provide machine learning (ML) researchers, domain experts, and decision makers with robust and accurate feature importance quantification and more reliable mechanistic interpretation of feature importance for prediction problems using fuzzy sets. The toolkit was developed to address uncertainties in feature importance quantification and lack of trustworthy feature importance interpretation due to the diverse availability of machine learning algorithms, feature importance calculation methods, and dataset dependencies. EFI merges results from multiple machine learning models with different feature importance calculation approaches using data bootstrapping and decision fusion techniques, such as mean, majority voting and fuzzy logic. The main attributes of the EFI toolbox are: (i) automatic optimisation of ML algorithms, (ii) automatic computation of a set of feature importance coefficients from optimised ML algorithms and feature importance calculation techniques, (iii) automatic aggregation of importance coefficients using multiple decision fusion techniques, and (iv) fuzzy membership functions that show the importance of each feature to the prediction task. The key modules and functions of the toolbox are described, and a simple example of their application is presented using the popular Iris dataset.


Learn Linear Regression ForMachine Learning

#artificialintelligence

Machine learning allows an algorithm to become more accurate at predicting outcomes without being explicitly programmed to do so. Predicting is one of the things that ML can do but actually, you can do much more cool stuff with it too and once you go deep into it you'll learn all about it. You can Read My Machine Learning Posts Here. So until now, we've done a lot of things with data. We've handled missing values, handled string data and we'll learn to do much more cool stuff in the future.


An Empirical Analysis of the Laplace and Neural Tangent Kernels

arXiv.org Artificial Intelligence

The neural tangent kernel is a kernel function defined over the parameter distribution of an infinite width neural network. Despite the impracticality of this limit, the neural tangent kernel has allowed for a more direct study of neural networks and a gaze through the veil of their black box. More recently, it has been shown theoretically that the Laplace kernel and neural tangent kernel share the same reproducing kernel Hilbert space in the space of $\mathbb{S}^{d-1}$ alluding to their equivalence. In this work, we analyze the practical equivalence of the two kernels. We first do so by matching the kernels exactly and then by matching posteriors of a Gaussian process. Moreover, we analyze the kernels in $\mathbb{R}^d$ and experiment with them in the task of regression.


A machine learning approach to predict the structural and magnetic properties of Heusler alloy families

arXiv.org Artificial Intelligence

Random forest (RF) regression model is used to predict the lattice constant, magnetic moment and formation energies of full Heusler alloys, half Heusler alloys, inverse Heusler alloys and quaternary Heusler alloys based on existing as well as indigenously prepared databases. Prior analysis was carried out to check the distribution of the data points of the response variables and found that in most of the cases, the data is not normally distributed. The outcome of the RF model performance is sufficiently accurate to predict the response variables on the test data and also shows its robustness against overfitting, outliers, multicollinearity and distribution of data points. The parity plots between the machine learning predicted values against the computed values using density functional theory (DFT) shows linear behavior with adjusted R2 values lying in the range of 0.80 to 0.94 for all the predicted properties for different types of Heusler alloys. Feature importance analysis shows that the valence electron numbers plays an important feature role in the prediction for most of the predicted outcomes. Case studies with one full Heusler alloy and one quaternary Heusler alloy were also mentioned comparing the machine learning predicted results with our earlier theoretical calculated values and experimentally measured results, suggesting high accuracy of the model predicted results.


One Week of Data Science in Python - New 2022!

#artificialintelligence

Perform statistical analysis on real world datasets Understand feature engineering strategies and tools Perform one hot encoding and normalization Understand the difference between normalization and standardization Deal with missing data using pandas Change pandas DataFrame datatypes Define a function and apply it to a Pandas DataFrame column Perform Pandas operations and filtering Calculate and display correlation matrix heatmap Perform data visualization using Seaborn and Matplotlib libraries Plot single line plot, pie charts and multiple subplots using matplotlib Plot pairplot, countplot, and correlation heatmaps using Seaborn Plot distribution plot (distplot), Histograms and scatterplots Understand machine learning regression fundamentals Learn how to optimize model parameters using least sum of squares Split the data into training and testing using SK Learn Library Perform data visualization and basic exploratory data analysis Build, train and test our first regression model in Scikit-Learn Assess trained machine learning regression model performance Understand the theory and intuition behind boosting Train an XG-boost algorithm in Scikit-Learn to solve regression type problems Train several machine learning models classifier models such as Logistic Regression, Support Vector Machine, K-Nearest Neighbors, and Random Forest Classifier Assess trained model performance using various KPIs such as accuracy, precision, recall, F1-score, AUC and ROC. Compare the performance of the classification model using various KPIs. Apply autogluon to solve regression and classification type problems Use AutoGluon library to perform prototyping of AI/ML models using few lines of code Plot various models' performance on model leaderboard Optimize regression and classification models hyperparameters using SK-Learn Learn the difference between various hyperparameters optimization strategies such as grid search, randomized search, and Bayesian optimization. Assess trained model performance using various KPIs such as accuracy, precision, recall, F1-score, AUC and ROC. Compare the performance of the classification model using various KPIs.


Regression Analysis Is Exceedingly Difficult: How to Master It Without Coding

#artificialintelligence

Regression analysis is a technique that can be used to [10] predict future outcomes of use cases. In machine learning, regression analysis is particularly useful when training models on large data sets. To achieve measurable outputs, we use historical data for prediction. Regression analysis is a complex technique, and there are many ways to perform it. Here, I will go over the basics of regression analysis using a simple example.