Goto

Collaborating Authors

 Regression


TSVR+: Twin support vector regression with privileged information

arXiv.org Artificial Intelligence

In the realm of machine learning, the data may contain additional attributes, known as privileged information (PI). The main purpose of PI is to assist in the training of the model and then utilize the acquired knowledge to make predictions for unseen samples. Support vector regression (SVR) is an effective regression model, however, it has a low learning speed due to solving a convex quadratic problem (QP) subject to a pair of constraints. In contrast, twin support vector regression (TSVR) is more efficient than SVR as it solves two QPs each subject to one set of constraints. However, TSVR and its variants are trained only on regular features and do not use privileged features for training. To fill this gap, we introduce a fusion of TSVR with learning using privileged information (LUPI) and propose a novel approach called twin support vector regression with privileged information (TSVR+). The regularization terms in the proposed TSVR+ capture the essence of statistical learning theory and implement the structural risk minimization principle. We use the successive overrelaxation (SOR) technique to solve the optimization problem of the proposed TSVR+, which enhances the training efficiency. As far as our knowledge extends, the integration of the LUPI concept into twin variants of regression models is a novel advancement. The numerical experiments conducted on UCI, stock and time series data collectively demonstrate the superiority of the proposed model.


Estimation of Physical Parameters of Waveforms With Neural Networks

arXiv.org Artificial Intelligence

Light Detection and Ranging (LiDAR) are fast emerging sensors in the field of Earth Observation. It is a remote sensing technology that utilizes laser beams to measure distances and create detailed three-dimensional representations of objects and environments. The potential of Full Waveform LiDAR is much greater than just height estimation and 3D reconstruction only. Overall shape of signal provides important information about properties of water body. However, the shape of FWL is unexplored as most LiDAR software work on point cloud by utilizing the maximum value within the waveform. Existing techniques in the field of LiDAR data analysis include depth estimation through inverse modeling and regression of logarithmic intensity and depth for approximating the attenuation coefficient. However, these methods suffer from limitations in accuracy. Depth estimation through inverse modeling provides only approximate values and does not account for variations in surface properties, while the regression approach for the attenuation coefficient is only able to generalize a value through several data points which lacks precision and may lead to significant errors in estimation. Additionally, there is currently no established modeling method available for predicting bottom reflectance. This research proposed a novel solution based on neural networks for parameter estimation in LIDAR data analysis. By leveraging the power of neural networks, the proposed solution successfully learned the inversion model, was able to do prediction of parameters such as depth, attenuation coefficient, and bottom reflectance. Performance of model was validated by testing it on real LiDAR data. In future, more data availability would enable more accuracy and reliability of such models.


On minimizing the training set fill distance in machine learning regression

arXiv.org Artificial Intelligence

However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.


Bayesian Nonlinear Regression using Sums of Simple Functions

arXiv.org Machine Learning

This paper proposes a new Bayesian machine learning model that can be applied to large datasets arising in macroeconomics. Our framework sums over many simple two-component location mixtures. The transition between components is determined by a logistic function that depends on a single threshold variable and two hyperparameters. Each of these individual models only accounts for a minor portion of the variation in the endogenous variables. But many of them are capable of capturing arbitrary nonlinear conditional mean relations. Conjugate priors enable fast and efficient inference. In simulations, we show that our approach produces accurate point and density forecasts. In a real-data exercise, we forecast US macroeconomic aggregates and consider the nonlinear effects of financial shocks in a large-scale nonlinear VAR.


Intrusion Detection System with Machine Learning and Multiple Datasets

arXiv.org Artificial Intelligence

As Artificial Intelligence (AI) technologies continue to gain traction in the modern-day world, they ultimately pose an immediate threat to current cybersecurity systems via exploitative methods. Prompt engineering is a relatively new field that explores various prompt designs that can hijack large language models (LLMs). If used by an unethical attacker, it can enable an AI system to offer malicious insights and code to them. In this paper, an enhanced intrusion detection system (IDS) that utilizes machine learning (ML) and hyperparameter tuning is explored, which can improve a model's performance in terms of accuracy and efficacy. Ultimately, this improved system can be used to combat the attacks made by unethical hackers. A standard IDS is solely configured with pre-configured rules and patterns; however, with the utilization of machine learning, implicit and different patterns can be generated through the models' hyperparameter settings and parameters. In addition, the IDS will be equipped with multiple datasets so that the accuracy of the models improves. We evaluate the performance of multiple ML models and their respective hyperparameter settings through various metrics to compare their results to other models and past research work. The results of the proposed multi-dataset integration method yielded an accuracy score of 99.9% when equipped with the XGBoost and random forest classifiers and RandomizedSearchCV hyperparameter technique.


Scaling Laws in Jet Classification

arXiv.org Artificial Intelligence

We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.


Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction

arXiv.org Artificial Intelligence

The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K- Nearest Neighbors, and Random Forest) to predict crop yields in 196 countries. Given 4 key training parameters, pesticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r^2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.


Distributed Continual Learning with CoCoA in High-dimensional Linear Regression

arXiv.org Artificial Intelligence

We consider estimation under scenarios where the signals of interest exhibit change of characteristics over time. In particular, we consider the continual learning problem where different tasks, e.g., data with different distributions, arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the problem from a distributed estimation perspective. We consider the well-established distributed learning algorithm COCOA, which distributes the model parameters and the corresponding features over the network. We provide exact analytical characterization for the generalization error of COCOA under continual learning for linear regression in a range of scenarios, where overparameterization is of particular interest. These analytical results characterize how the generalization error depends on the network structure, the task similarity and the number of tasks, and show how these dependencies are intertwined. In particular, our results show that the generalization error can be significantly reduced by adjusting the network size, where the most favorable network size depends on task similarity and the number of tasks. We present numerical results verifying the theoretical analysis and illustrate the continual learning performance of COCOA with a digit classification task.


BatteryML:An Open-source platform for Machine Learning on Battery Degradation

arXiv.org Artificial Intelligence

Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at https://github.com/microsoft/BatteryML.


Accuracy Improvement in Differentially Private Logistic Regression: A Pre-training Approach

arXiv.org Artificial Intelligence

Machine learning (ML) models can memorize training datasets. As a result, training ML models over private datasets can lead to the violation of individuals' privacy. Differential privacy (DP) is a rigorous privacy notion to preserve the privacy of underlying training datasets. Yet, training ML models in a DP framework usually degrades the accuracy of ML models. This paper aims to boost the accuracy of a DP logistic regression (LR) via a pre-training module. In more detail, we initially pre-train our LR model on a public training dataset that there is no privacy concern about it. Then, we fine-tune our DP-LR model with the private dataset. In the numerical results, we show that adding a pre-training module significantly improves the accuracy of the DP-LR model.