Regression
Distributed Sparse Linear Regression under Communication Constraints
In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.
Machine learning modeling for the prediction of plastic properties in metallic glasses
Metallic glasses are one of the most interesting mechanical materials studied in the last years, but as amorphous solids, they differ strongly from their crystalline counterparts. This matter can be addressed with the development and application of predictive techniques capable to describe the plastic regime. Here, machine learning models were employed for the prediction of plastic properties in CuZr metallic glasses. To this aim, 100 different samples were subjected to tensile tests by means of molecular dynamics simulations. A total of 17 materials properties were calculated and explored using statistical analysis. Strong correlations were found for stoichiometry, temperature, structural, and elastic properties with plastic properties. Three regression models were employed for the prediction of six plastic properties. Linear and Ridge regressions delivered the better prediction capability, with coefficients of determination above $$\sim$$ 80% for three plastic properties, whereas Lasso regression rendered lower performance, with coefficients of determination above $$\sim$$ 60% for two plastic properties. Overall, our work shows that molecular dynamics simulations together with machine learning models can provide a framework for the prediction of plastic behavior of complex materials.
GSR: A Generalized Symbolic Regression Approach
Tohme, Tony, Liu, Dehong, Youcef-Toumi, Kamal
Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method is competitive with strong SR benchmark methods, achieving promising experimental performance on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation
The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we provide a two-phase extension of M-SPP to achieve comparable convergence rates. Moreover, we derive a near-tight high probability (over the randomness of data) bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
A Bayesian Robust Regression Method for Corrupted Data Reconstruction
Fan, Zheyi, Li, Zhaohui, Wang, Jingyan, Lin, Dennis K. J., Xiong, Xiao, Hu, Qingpei
Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is often available from historical data or engineering experience, and by incorporating prior information into a robust regression method, we develop an effective robust regression method that can resist adaptive adversarial attacks. First, we propose the novel TRIP (hard Thresholding approach to Robust regression with sImple Prior) algorithm, which improves the breakdown point when facing adaptive adversarial attacks. Then, to improve the robustness and reduce the estimation error caused by the inclusion of priors, we use the idea of Bayesian reweighting to construct the more robust BRHT (robust Bayesian Reweighting regression via Hard Thresholding) algorithm. We prove the theoretical convergence of the proposed algorithms under mild conditions, and extensive experiments show that under different types of dataset attacks, our algorithms outperform other benchmark ones. Finally, we apply our methods to a data-recovery problem in a real-world application involving a space solar array, demonstrating their good applicability.
All Machine Learning Algorithms You Should Know for 2023
Linear/Logistic Regression: a statistical method for modeling the linear relationship between a dependent variable and one or more independent variables -- can be used to understand the relationships between variables based on the t-tests and coefficients. Decision Trees: a type of machine learning algorithm that creates a tree-like model of decisions and their possible consequences. They are useful for understanding the relationships between variables by looking at the rules that split the branches. Principal Component Analysis (PCA): a dimensionality reduction technique that projects the data onto a lower-dimensional space while retaining as much variance as possible. PCA can be used to simplify the data or to determine feature importance.
Machine Learning to Estimate Gross Loss of Jewelry for Wax Patterns
Jain, Mihir, Jain, Kashish, Mane, Sandip
In mass manufacturing of jewellery, the gross loss is estimated before manufacturing to calculate the wax weight of the pattern that would be investment casted to make multiple identical pieces of jewellery. Machine learning is a technology that is a part of AI which helps create a model with decision-making capabilities based on a large set of user-defined data. In this paper, the authors found a way to use Machine Learning in the jewellery industry to estimate this crucial Gross Loss. Choosing a small data set of manufactured rings and via regression analysis, it was found out that there is a potential of reducing the error in estimation from +-2-3 to +-0.5 using ML Algorithms from historic data and attributes collected from the CAD file during the design phase itself. To evaluate the approach's viability, additional study must be undertaken with a larger data set.
How Bayesian additive regression trees(BART) are used part2(Machine Learning)
Abstract: Methods utilizing instrumental variables have been a fundamental statistical approach to estimation in the presence of unmeasured confounding, usually occurring in non-randomized observational data common to fields such as economics and public health. However, such methods usually make constricting linearity and additivity assumptions that are inapplicable to the complex modeling challenges of today. The growing body of observational data being collected will necessitate flexible regression modeling while also being able to control for confounding using instrumental variables. Therefore, this article presents a nonlinear instrumental variable regression model based on Bayesian regression tree ensembles to estimate such relationships, including interactions, in the presence of confounding. One exciting application of this method is to use genetic variants as instruments, known as Mendelian randomization.
How Bayesian additive regression trees(BART) are used part3(Machine Learning)
Abstract: Using ensemble methods for regression has been a large success in obtaining high-accuracy prediction. Examples are Bagging, Random forest, Boosting, BART (Bayesian additive regression tree), and their variants. In this paper, we propose a new perspective named variable grouping to enhance the predictive performance. The main idea is to seek for potential grouping of variables in such way that there is no nonlinear interaction term between variables of different groups. Given a sum-of-learner model, each learner will only be responsible for one group of variables, which would be more efficient in modeling nonlinear interactions.
Isotonic Recalibration under a Low Signal-to-Noise Ratio
Wüthrich, Mario V., Ziegel, Johanna
There are two seemingly unrelated problems in insurance pricing that we are going to tackle in this paper. First, an insurance pricing system should not have any systematic cross-financing between different price cohorts. Systematic cross-financing implicitly means that some parts of the portfolio are under-priced, and this is compensated by other parts of the portfolio that are over-priced. We can prevent systematic cross-financing between price cohorts by ensuring that the pricing system is auto-calibrated. We propose to apply isotonic recalibration which turns any regression function into an auto-calibrated pricing system.