Goto

Collaborating Authors

 explanatory variable


A Granular Framework for Construction Material Price Forecasting: Econometric and Machine-Learning Approaches

Lyu, Boge, Yin, Qianye, Tommelein, Iris Denise, Liu, Hanyang, Ranka, Karnamohit, Yeluripati, Karthik, Shi, Junzhe

arXiv.org Artificial Intelligence

This study develops a forecasting framework t hat leverages the Construction Specifications Institute (CSI) MasterFormat as the target data structure, enabling predictions at the six - digit section level and supporting detailed cost projections across a wide spectrum of building materials. To enhance p redictive accuracy, the framework integrates explanatory variables such as raw material prices, commodity indexes, and macroeconomic indicators. Four time - series models, Long Short - Term Memory (LSTM), Autoregressive Integrated Moving Average (ARIMA), Vecto r Error Correction Model (VECM), and Chronos - Bolt, were evaluated under both baseline configurations (using CSI data only) and extended versions with explanatory variables. Results demonstrate that incorporating explanatory variables significantly improves predictive performance across all models. Among the tested approaches, the LSTM model consistently ach ieved the highest accuracy, with RMSE values as low as 1.390 and MAPE values of 0.957, representing improvements of up to 59 % over traditional statistical time - series model, ARIMA. Validation across multiple CSI divisions confirmed the framework's scalability, while Division 06 (Wood, Plastics, and Composites) is presented in detail as a demonstration case. This research offers a robust methodology that enables owners and contractors to improve budgeting practices and achieve more reliable cost estimation at the Definitive level. INTRODUCTION 1.1 Motivation The construction industry continues to demonstrate steady long - term growth, with global activity projected to reach US$9.8 trillion by 2026 [1] . Major upcoming programs in the United States, such as the Los Angeles 2028 Olympics and TSMC's fabrication facility in Arizona [2] [3], highlight the scale of high - value projects in the near future. However, volatility in construction material prices has emerged as a critical challenge, creating significant uncertainty for contractors in project planning, budgeting, and cost management. Price fluctuations, driven by raw material costs, macroeconomic conditions such as inflation and interest rates, and supply - demand imbalances, have amplified risks of cost overruns and delays [4] [5] [6] [7] [8] . Traditional econometric methods (i.e.,multiple regression analysis) and modern econometric methods (i.e., univariate, and multivariate time series methods) have faced limitations in effectively capturing the high - frequency volatility observed in constructi on material prices [9] . These models often struggle to handle the complexity of input data and exhibit limited predictive accuracy in real - world applications.


An Infinite BART model

Battiston, Marco, Luo, Yu

arXiv.org Machine Learning

Bayesian additive regression trees (BART) are popular Bayesian ensemble models used in regression and classification analysis. Under this modeling framework, the regression function is approximated by an ensemble of decision trees, interpreted as weak learners that capture different features of the data. In this work, we propose a generalization of the BART model that has two main features: first, it automatically selects the number of decision trees using the given data; second, the model allows clusters of observations to have different regression functions since each data point can only use a selection of weak learners, instead of all of them. This model generalization is accomplished by including a binary weight matrix in the conditional distribution of the response variable, which activates only a specific subset of decision trees for each observation. Such a matrix is endowed with an Indian Buffet process prior, and sampled within the MCMC sampler, together with the other BART parameters. We then compare the Infinite BART model with the classic one on simulated and real datasets. Specifically, we provide examples illustrating variable importance, partial dependence and causal estimation.



Feature-free regression kriging

Luo, Peng, Wu, Yilong, Song, Yongze

arXiv.org Machine Learning

Spatial interpolation is a crucial task in geography. As perhaps the most widely used interpolation methods, geostatistical models -- such as Ordinary Kriging (OK) -- assume spatial stationarity, which makes it difficult to capture the nonstationary characteristics of geographic variables. A common solution is trend surface modeling (e.g., Regression Kriging, RK), which relies on external explanatory variables to model the trend and then applies geostatistical interpolation to the residuals. However, this approach requires high-quality and readily available explanatory variables, which are often lacking in many spatial interpolation scenarios -- such as estimating heavy metal concentrations underground. This study proposes a Feature-Free Regression Kriging (FFRK) method, which automatically extracts geospatial features -- including local dependence, local heterogeneity, and geosimilarity -- to construct a regression-based trend surface without requiring external explanatory variables. We conducted experiments on the spatial distribution prediction of three heavy metals in a mining area in Australia. In comparison with 17 classical interpolation methods, the results indicate that FFRK, which does not incorporate any explanatory variables and relies solely on extracted geospatial features, consistently outperforms both conventional Kriging techniques and machine learning models that depend on explanatory variables. This approach effectively addresses spatial nonstationarity while reducing the cost of acquiring explanatory variables, improving both prediction accuracy and generalization ability. This finding suggests that an accurate characterization of geospatial features based on domain knowledge can significantly enhance spatial prediction performance -- potentially yielding greater improvements than merely adopting more advanced statistical models.


Explainable Multimodal Machine Learning for Revealing Structure-Property Relationships in Carbon Nanotube Fibers

Kimura, Daisuke, Tajima, Naoko, Okazaki, Toshiya, Muroga, Shun

arXiv.org Artificial Intelligence

In this study, we propose Explainable Multimodal Machine Learning (EMML), which integrates the analysis of diverse data types (multimodal data) using factor analysis for feature extraction with Explainable AI (XAI), for carbon nanotube (CNT) fibers prepared from aqueous dispersions. This method is a powerful approach to elucidate the mechanisms governing material properties, where multi-stage fabrication conditions and multiscale structures have complex influences. Thus, in our case, this approach helps us understand how different processing steps and structures at various scales impact the final properties of CNT fibers. The analysis targeted structures ranging from the nanoscale to the macroscale, including aggregation size distributions of CNT dispersions and the effective length of CNTs. Furthermore, because some types of data were difficult to interpret using standard methods, challenging-to-interpret distribution data were analyzed using Negative Matrix Factorization (NMF) for extracting key features that determine the outcome. Contribution analysis with SHapley Additive exPlanations (SHAP) demonstrated that small, uniformly distributed aggregates are crucial for improving fracture strength, while CNTs with long effective lengths are significant factors for enhancing electrical conductivity. The analysis also identified thresholds and trends for these key factors to assist in defining the conditions needed to optimize CNT fiber properties. EMML is not limited to CNT fibers but can be applied to the design of other materials derived from nanomaterials, making it a useful tool for developing a wide range of advanced materials. This approach provides a foundation for advancing data-driven materials research.


Functional relevance based on the continuous Shapley value

Delicado, Pedro, Pachón-García, Cristian

arXiv.org Machine Learning

The presence of Artificial Intelligence (AI) in our society is increasing, which brings with it the need to understand the behaviour of AI mechanisms, including machine learning predictive algorithms fed with tabular data, text, or images, among other types of data. This work focuses on interpretability of predictive models based on functional data. Designing interpretability methods for functional data models implies working with a set of features whose size is infinite. In the context of scalar on function regression, we propose an interpretability method based on the Shapley value for continuous games, a mathematical formulation that allows to fairly distribute a global payoff among a continuous set players. The method is illustrated through a set of experiments with simulated and real data sets. The open source Python package ShapleyFDA is also presented.


A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data

Chakraborty, Saptarshi, Bartlett, Peter L.

arXiv.org Machine Learning

Federated Learning (FL) has emerged as a groundbreaking paradigm in collaborative machine learning, emphasizing decentralized model training to address data privacy concerns. While significant progress has been made in optimizing federated learning, the exploration of generalization error, particularly in heterogeneous settings, has been limited, focusing mainly on parametric cases. This paper investigates the generalization properties of deep federated regression within a two-stage sampling model. Our findings highlight that the intrinsic dimension, defined by the entropic dimension, is crucial for determining convergence rates when appropriate network sizes are used. Specifically, if the true relationship between response and explanatory variables is charecterized by a $\beta$-H\"older function and there are $n$ independent and identically distributed (i.i.d.) samples from $m$ participating clients, the error rate for participating clients scales at most as $\tilde{O}\left((mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$, and for non-participating clients, it scales as $\tilde{O}\left(\Delta \cdot m^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))} + (mn)^{-2\beta/(2\beta + \bar{d}_{2\beta}(\lambda))}\right)$. Here, $\bar{d}_{2\beta}(\lambda)$ represents the $2\beta$-entropic dimension of $\lambda$, the marginal distribution of the explanatory variables, and $\Delta$ characterizes the dependence between the sampling stages. Our results explicitly account for the "closeness" of clients, demonstrating that the convergence rates of deep federated learners depend on intrinsic rather than nominal high-dimensionality.


Identifying Privacy Personas

Hrynenko, Olena, Cavallaro, Andrea

arXiv.org Artificial Intelligence

Privacy personas capture the differences in user segments with respect to one's knowledge, behavioural patterns, level of self-efficacy, and perception of the importance of privacy protection. Modelling these differences is essential for appropriately choosing personalised communication about privacy (e.g. to increase literacy) and for defining suitable choices for privacy enhancing technologies (PETs). While various privacy personas have been derived in the literature, they group together people who differ from each other in terms of important attributes such as perceived or desired level of control, and motivation to use PET. To address this lack of granularity and comprehensiveness in describing personas, we propose eight personas that we derive by combining qualitative and quantitative analysis of the responses to an interactive educational questionnaire. We design an analysis pipeline that uses divisive hierarchical clustering and Boschloo's statistical test of homogeneity of proportions to ensure that the elicited clusters differ from each other based on a statistical measure. Additionally, we propose a new measure for calculating distances between questionnaire responses, that accounts for the type of the question (closed- vs open-ended) used to derive traits. We show that the proposed privacy personas statistically differ from each other. We statistically validate the proposed personas and also compare them with personas in the literature, showing that they provide a more granular and comprehensive understanding of user segments, which will allow to better assist users with their privacy needs.


Reviews: Variable Importance Using Decision Trees

Neural Information Processing Systems

The article tackles the problem of variable importance in regression trees. The strategy is to select the variables based on the impurity reduction they induce on label Y. The main feature of this strategy is that the impurity reduction measure is based on the ordering of Y according to the ranking of the X variable under consideration, therefore it measures the relationship between Y and any variable in a more robust way than simple correlation would. The authors prove that this strategy is consistent (i.e. the true explanatory variables are selected) in a range of settings. This is then illustrated on a simulated example where the results displayed are somewhat the ones one could have expected: the proposed procedure is able to account for monotone but non linear relationships between X and Y so it yields better results than simple correlations.


Branch and Bound to Assess Stability of Regression Coefficients in Uncertain Models

Knaeble, Brian, Hughes, R. Mitchell, Rudolph, George, Abramson, Mark A., Razo, Daniel

arXiv.org Artificial Intelligence

It can be difficult to interpret a coefficient of an uncertain model. A slope coefficient of a regression model may change as covariates are added or removed from the model. In the context of high-dimensional data, there are too many model extensions to check. However, as we show here, it is possible to efficiently search, with a branch and bound algorithm, for maximum and minimum values of that adjusted slope coefficient over a discrete space of regularized regression models. Here we introduce our algorithm, along with supporting mathematical results, an example application, and a link to our computer code, to help researchers summarize high-dimensional data and assess the stability of regression coefficients in uncertain models.