rforest
Time-Series Forecasting: Unleashing Long-Term Dependencies with Fractionally Differenced Data
Maitra, Sarit, Mishra, Vivek, Dwivedi, Srashti, Kundu, Sukanya, Kundu, Goutam Kumar
This study introduces a novel forecasting strategy that leverages the power of fractional differencing (FD) to capture both short- and long-term dependencies in time series data. Unlike traditional integer differencing methods, FD preserves memory in series while stabilizing it for modeling purposes. By applying FD to financial data from the SPY index and incorporating sentiment analysis from news reports, this empirical analysis explores the effectiveness of FD in conjunction with binary classification of target variables. Supervised classification algorithms were employed to validate the performance of FD series. The results demonstrate the superiority of FD over integer differencing, as confirmed by Receiver Operating Characteristic/Area Under the Curve (ROCAUC) and Mathews Correlation Coefficient (MCC) evaluations.
Use H2O and data.table to build models on large data sets in R
Last week, I wrote an introductory article on the package data.table. It was intended to provide you a head start and become familiar with its unique and short syntax. The next obvious step is to focus on modeling, which we will do in this post today. Atleast, I used to think of myself as a crippled R user when faced with large data sets. I would like to thank Matt Dowle again for this accomplishment. Algorithms like random forest (ntrees 1000) takes forever to run on my data set with 800,000 rows. I'm sure there are many R users who are trapped in a similar situation. To overcome this painstaking hurdle, I decided to write this post which demonstrates using the two most powerful packages i.e. For practical understanding, I've taken the data set from a practice problem and tried to improve the score using 4 different machine learning algorithms (with H2O) & feature engineering (with data.table).
Transformation Forests
Hothorn, Torsten, Zeileis, Achim
Regression models for supervised learning problems with a continuous target are commonly understood as models for the conditional mean of the target given predictors. This notion is simple and therefore appealing for interpretation and visualisation. Information about the whole underlying conditional distribution is, however, not available from these models. A more general understanding of regression models as models for conditional distributions allows much broader inference from such models, for example the computation of prediction intervals. Several random forest-type algorithms aim at estimating conditional distributions, most prominently quantile regression forests (Meinshausen, 2006, JMLR). We propose a novel approach based on a parametric family of distributions characterised by their transformation function. A dedicated novel "transformation tree" algorithm able to detect distributional changes is developed. Based on these transformation trees, we introduce "transformation forests" as an adaptive local likelihood estimator of conditional distribution functions. The resulting models are fully parametric yet very general and allow broad inference procedures, such as the model-based bootstrap, to be applied in a straightforward way.
Use H2O and data.table to build models on large data sets in R
Last week, I wrote an introductory article on the package data.table. It was intended to provide you a head start and become familiar with its unique and short syntax. The next obvious step is to focus on modeling, which we will do in this post today. Atleast, I used to think of myself as a crippled R user when faced with large data sets. I would like to thank Matt Dowle again for this accomplishment. Algorithms like random forest (ntrees 1000) takes forever to run on my data set with 800,000 rows. I'm sure there are many R users who are trapped in a similar situation. To overcome this painstaking hurdle, I decided to write this post which demonstrates using the two most powerful packages i.e. For practical understanding, I've taken the data set from a previously held competition and tried to improve the score using 4 different machine learning algorithms (with H2O) & feature engineering (with data.table).