Regression
A Meta-Analysis of Solar Forecasting Based on Skill Score
Nguyen, Thi Ngoc, Müsgens, Felix
We conduct the first comprehensive meta-analysis of deterministic solar forecasting based on skill score, screening 1,447 papers from Google Scholar and reviewing the full texts of 320 papers for data extraction. A database of 4,687 points was built and analyzed with multivariate adaptive regression spline modelling, partial dependence plots, and linear regression. The marginal impacts on skill score of ten factors were quantified. The analysis shows the non-linearity and complex interaction between variables in the database. Forecast horizon has a central impact and dominates other factors' impacts. Therefore, the analysis of solar forecasts should be done separately for each horizon. Climate zone variables have statistically significant correlation with skill score. Regarding inputs, historical data and spatial temporal information are highly helpful. For intra-day, sky and satellite images show the most importance. For day-ahead, numerical weather predictions and locally measured meteorological data are very efficient. All forecast models were compared. Ensemble-hybrid models achieve the most accurate forecasts for all horizons. Hybrid models show superiority for intra-hour while image-based methods are the most efficient for intra-day forecasts. More training data can enhance skill score. However, over-fitting is observed when there is too much training data (longer than 2000 days). There has been a substantial improvement in solar forecast accuracy, especially in recent years. More improvement is observed for intra-hour and intra-day than day-ahead forecasts. By controlling for the key differences between forecasts, including location variables, our findings can be applied globally.
On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions
Beaglehole, Daniel, Belkin, Mikhail, Pandit, Parthe
``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest. We show, using a fixed design setup, that an important class of predictors, kernel machines with translation-invariant kernels, does not exhibit benign overfitting in fixed dimensions. In particular, the estimated predictor does not converge to the ground truth with increasing sample size, for any non-zero regression function and any (even adaptive) bandwidth selection. To prove these results, we give exact expressions for the generalization error, and its decomposition in terms of an approximation error and an estimation error that elicits a trade-off based on the selection of the kernel bandwidth. Our results apply to commonly used translation-invariant kernels such as Gaussian, Laplace, and Cauchy.
Generative modeling for time series via Schr{\"o}dinger bridge
Hamdouche, Mohamed, Henry-Labordere, Pierre, Pham, Huyên
We propose a novel generative model for time series based on Schr{\"o}dinger bridge (SB) approach. This consists in the entropic interpolation via optimal transport between a reference probability measure on path space and a target measure consistent with the joint data distribution of the time series. The solution is characterized by a stochastic differential equation on finite horizon with a path-dependent drift function, hence respecting the temporal dynamics of the time series distribution. We can estimate the drift function from data samples either by kernel regression methods or with LSTM neural networks, and the simulation of the SB diffusion yields new synthetic data samples of the time series. The performance of our generative model is evaluated through a series of numerical experiments. First, we test with a toy autoregressive model, a GARCH Model, and the example of fractional Brownian motion, and measure the accuracy of our algorithm with marginal and temporal dependencies metrics. Next, we use our SB generated synthetic samples for the application to deep hedging on real-data sets. Finally, we illustrate the SB approach for generating sequence of images.
Electricity Demand Forecasting with Hybrid Statistical and Machine Learning Algorithms: Case Study of Ukraine
Grandon, Tatiana Gonzalez, Schwenzer, Johannes, Steens, Thomas, Breuing, Julia
This article presents a novel hybrid approach using statistics and machine learning to forecast the national demand of electricity. As investment and operation of future energy systems require long-term electricity demand forecasts with hourly resolution, our mathematical model fills a gap in energy forecasting. The proposed methodology was constructed using hourly data from Ukraine's electricity consumption ranging from 2013 to 2020. To this end, we analysed the underlying structure of the hourly, daily and yearly time series of electricity consumption. The long-term yearly trend is evaluated using macroeconomic regression analysis. The mid-term model integrates temperature and calendar regressors to describe the underlying structure, and combines ARIMA and LSTM ``black-box'' pattern-based approaches to describe the error term. The short-term model captures the hourly seasonality through calendar regressors and multiple ARMA models for the residual. Results show that the best forecasting model is composed by combining multiple regression models and a LSTM hybrid model for residual prediction. Our hybrid model is very effective at forecasting long-term electricity consumption on an hourly resolution. In two years of out-of-sample forecasts with 17520 timesteps, it is shown to be within 96.83 \% accuracy.
A Predictive Model using Machine Learning Algorithm in Identifying Students Probability on Passing Semestral Course
This study aims to determine a predictive model to learn students probability to pass their courses taken at the earliest stage of the semester. To successfully discover a good predictive model with high acceptability, accurate, and precision rate which delivers a useful outcome for decision making in education systems, in improving the processes of conveying knowledge and uplifting students academic performance, the proponent applies and strictly followed the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. This study employs classification for data mining techniques, and decision tree for algorithm. With the utilization of the newly discovered predictive model, the prediction of students probabilities to pass the current courses they take gives 0.7619 accuracy, 0.8333 precision, 0.8823 recall, and 0.8571 f1 score, which shows that the model used in the prediction is reliable, accurate, and recommendable. Considering the indicators and the results, it can be noted that the prediction model used in this study is highly acceptable. The data mining techniques provides effective and efficient innovative tools in analyzing and predicting student performances. The model used in this study will greatly affect the way educators understand and identify the weakness of their students in the class, the way they improved the effectiveness of their learning processes gearing to their students, bring down academic failure rates, and help institution administrators modify their learning system outcomes. Further study for the inclusion of some students demographic information, vast amount of data within the dataset, automated and manual process of predictive criteria indicators where the students can regulate to which criteria, they must improve more for them to pass their courses taken at the end of the semester as early as midterm period are highly needed.
Regression-based Deep-Learning predicts molecular biomarkers from pathology slides
Nahhas, Omar S. M. El, Loeffler, Chiara M. L., Carrero, Zunamys I., van Treeck, Marko, Kolbinger, Fiona R., Hewitt, Katherine J., Muti, Hannah S., Graziani, Mara, Zeng, Qinghe, Calderaro, Julien, Ortiz-Brüchle, Nadina, Yuan, Tanwei, Hoffmeister, Michael, Brenner, Hermann, Brobeil, Alexander, Reis-Filho, Jorge S., Kather, Jakob Nikolas
Deep Learning (DL) can predict biomarkers from cancer histopathology. Several clinically approved applications use this technology. Most approaches, however, predict categorical labels, whereas biomarkers are often continuous measurements. We hypothesized that regression-based DL outperforms classification-based DL. Therefore, we developed and evaluated a new self-supervised attention-based weakly supervised regression method that predicts continuous biomarkers directly from images in 11,671 patients across nine cancer types. We tested our method for multiple clinically and biologically relevant biomarkers: homologous repair deficiency (HRD) score, a clinically used pan-cancer biomarker, as well as markers of key biological processes in the tumor microenvironment. Using regression significantly enhances the accuracy of biomarker prediction, while also improving the interpretability of the results over classification. In a large cohort of colorectal cancer patients, regression-based prediction scores provide a higher prognostic value than classification-based scores. Our open-source regression approach offers a promising alternative for continuous biomarker analysis in computational pathology.
XGBoost in R: A Step-by-Step Example
Boosting is a technique in machine learning that has been shown to produce models with high predictive accuracy. One of the most common ways to implement boosting in practice is to use XGBoost, short for "extreme gradient boosting." This tutorial provides a step-by-step example of how to use XGBoost to fit a boosted model in R. For this example we'll fit a boosted regression model to the Boston dataset from the MASS package. This dataset contains 13 predictor variables that we'll use to predict one response variable called mdev, which represents the median value of homes in different census tracts around Boston. We can see that the dataset contains 506 observations and 14 total variables.
LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search
Jin, Qiao, Shin, Andrew, Lu, Zhiyong
Queries with similar information needs tend to have similar document clicks, especially in biomedical literature search engines where queries are generally short and top documents account for most of the total clicks. Motivated by this, we present a novel architecture for biomedical literature search, namely Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that augments a dense retriever with the click logs retrieved from similar training queries. Specifically, LADER finds both similar documents and queries to the given query by a dense retriever. Then, LADER scores relevant (clicked) documents of similar queries weighted by their similarity to the input query. The final document scores by LADER are the average of (1) the document similarity scores from the dense retriever and (2) the aggregated document scores from the click logs of similar queries. Despite its simplicity, LADER achieves new state-of-the-art (SOTA) performance on TripClick, a recently released benchmark for biomedical literature retrieval. On the frequent (HEAD) queries, LADER largely outperforms the best retrieval model by 39% relative NDCG@10 (0.338 v.s. 0.243). LADER also achieves better performance on the less frequent (TORSO) queries with 11% relative NDCG@10 improvement over the previous SOTA (0.303 v.s. 0.272). On the rare (TAIL) queries where similar queries are scarce, LADER still compares favorably to the previous SOTA method (NDCG@10: 0.310 v.s. 0.295). On all queries, LADER can improve the performance of a dense retriever by 24%-37% relative NDCG@10 while not requiring additional training, and further performance improvement is expected from more logs. Our regression analysis has shown that queries that are more frequent, have higher entropy of query similarity and lower entropy of document similarity, tend to benefit more from log augmentation.
Machine Learning in Three Steps: How to Efficiently Learn It
I have observed two extreme approaches when it comes to aspiring data scientists attempting to learn machine learning algorithms. The first approach involves learning all the intricacies of the algorithms and implementing them from scratch to gain true mastery. The second approach, on the other hand, assumes that the computer will "learn" on its own, rendering the need for the individual to learn the algorithms unnecessary. This leads some to only rely on tools such as the package lazypredict. It is realistic to take an approach between the two extremes when learning machine learning algorithms. However, the question remains, where to start? In this article, I will categorize machine learning algorithms into three categories and provide my humble opinion on what to begin with and what can be skipped. Starting out in machine learning can be overwhelming due to the multitude of available algorithms. Linear regression, support vector machines (SVM), gradient descent, gradient boosting, decision trees, LASSO, ridge, grid search, and many more are some of the algorithms that come to mind when posed with the question.