Regression
Development of Machine learning algorithms to identify the Cobb angle in adolescents with idiopathic scoliosis based on lumbosacral joint efforts during gait (Case study)
Samadi, Bahare, Raison, Maxime, Mahaudens, Philippe, Detrembleur, Christine, Achiche, Sofiane
Objectives: To quantify the magnitude of spinal deformity in adolescent idiopathic scoliosis (AIS), the Cobb angle is measured on X-ray images of the spine. Continuous exposure to X-ray radiation to follow-up the progression of scoliosis may lead to negative side effects on patients. Furthermore, manual measurement of the Cobb angle could lead to up to 10{\deg} or more of a difference due to intra/inter observer variation. Therefore, the objective of this study is to identify the Cobb angle by developing an automated radiation-free model, using Machine learning algorithms. Methods: Thirty participants with lumbar/thoracolumbar AIS (15{\deg} < Cobb angle < 66{\deg}) performed gait cycles. The lumbosacral (L5-S1) joint efforts during six gait cycles of participants were used as features to feed training algorithms. Various regression algorithms were implemented and run. Results: The decision tree regression algorithm achieved the best result with the mean absolute error equal to 4.6{\deg} of averaged 10-fold cross-validation. Conclusions: This study shows that the lumbosacral joint efforts during gait as radiation-free data are capable to identify the Cobb angle by using Machine learning algorithms. The proposed model can be considered as an alternative, radiation-free method to X-ray radiography to assist clinicians in following-up the progression of AIS.
Imbalanced Mixed Linear Regression
We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.
Machine Learning
The Machine Learning Specialization is a foundational online program created in collaboration between DeepLearning.AI and Stanford Online. This beginner-friendly program will teach you the fundamentals of machine learning and how to use these techniques to build real-world AI applications. This Specialization is taught by Andrew Ng, an AI visionary who has led critical research at Stanford University and groundbreaking work at Google Brain, Baidu, and Landing.AI to advance the AI field. This 3-course Specialization is an updated version of Andrew's pioneering Machine Learning course, rated 4.9 out of 5 and taken by over 4.8 million learners since it launched in 2012. It provides a broad introduction to modern machine learning, including supervised learning (multiple linear regression, logistic regression, neural networks, and decision trees), unsupervised learning (clustering, dimensionality reduction, recommender systems), and some of the best practices used in Silicon Valley for artificial intelligence and machine learning innovation (evaluating and tuning models, taking a data-centric approach to improving performance, and more.)
Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic
Lu, Xuan, Ai, Wei, Wang, Yixin, Mei, Qiaozhu
While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.
Overparameterized Linear Regression under Adversarial Attacks
Ribeiro, Antรดnio H., Schรถn, Thomas B.
We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.
An Analysis of Loss Functions for Binary Classification and Regression
This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.
Probabilistic Logistic Regression and Deep Learning
This article belongs to the series "Probabilistic Deep Learning". This weekly series covers probabilistic approaches to deep learning. The main goal is to extend deep learning models to quantify uncertainty, i.e., know what they do not know. In this article, we will introduce the concept of probabilistic logistic regression, a powerful technique that allows for the inclusion of uncertainty in the prediction process. We will explore how this approach can lead to more robust and accurate predictions, especially in cases where the data is noisy, or the model is overfitting.
A Benchmark Study by using various Machine Learning Models for Predicting Covid-19 trends
Kamelesun, D., Saranya, R., Kathiravan, P.
Machine learning and deep learning play vital roles in predicting diseases in the medical field. Machine learning algorithms are widely classified as supervised, unsupervised, and reinforcement learning. This paper contains a detailed description of our experimental research work in that we used a supervised machine-learning algorithm to build our model for outbreaks of the novel Coronavirus that has spread over the whole world and caused many deaths, which is one of the most disastrous Pandemics in the history of the world. The people suffered physically and economically to survive in this lockdown. This work aims to understand better how machine learning, ensemble, and deep learning models work and are implemented in the real dataset. In our work, we are going to analyze the current trend or pattern of the coronavirus and then predict the further future of the covid-19 confirmed cases or new cases by training the past Covid-19 dataset by using the machine learning algorithm such as Linear Regression, Polynomial Regression, K-nearest neighbor, Decision Tree, Support Vector Machine and Random forest algorithm are used to train the model. The decision tree and the Random Forest algorithm perform better than SVR in this work. The performance of SVR and lasso regression are low in all prediction areas Because the SVR is challenging to separate the data using the hyperplane for this type of problem. So SVR mostly gives a lower performance in this problem. Ensemble (Voting, Bagging, and Stacking) and deep learning models(ANN) also predict well. After the prediction, we evaluated the model using MAE, MSE, RMSE, and MAPE. This work aims to find the trend/pattern of the covid-19.
Regression - Shrijayan Rajendran - Medium
Regression is a statistical method used to analyze the relationship between one or more independent variables and a continuous dependent variable. It can be used to predict the value of the dependent variable based on the values of the independent variables. Linear regression is the most common type of regression and is used when the relationship between the variables is linear. Non-linear regression is used when the relationship between the variables is non-linear. Other types of regression include logistic regression, which is used when the dependent variable is binary, and polynomial regression, which is used when the relationship between the variables is non-linear but can be modeled by a polynomial equation.
Is a Small Dataset Risky?. Some reflections and tests on the useโฆ
Recently I have written an article about the risks of using the train_test_split() function provided by the scikit-learn Python package. That article has raised a lot of comments, some positives, and others with some concerns. The main concern in the article was that I used a small dataset to demonstrate my theory, which was: be careful when you use the train_test_split() function, because the different seeds may produce very different models. The main concern was that the train_test_split() function does not behave strangely; the problem is that I used a small dataset to demonstrate my thesis. In this article, I try to discover which is the performance of a Linear Regression model by varying the dataset size.