Regression
Gradient Coding through Iterative Block Leverage Score Sampling
Charalambides, Neophytos, Pilanci, Mert, Hero, Alfred
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distributed computational networks, \textit{i.e.} stragglers. We replicate the data across the distributed network, to attain the approximation guarantees through the induced sampling distribution. The significance and main contribution of this work, is that it unifies randomized numerical linear algebra with approximate coded computing, while attaining an induced $\ell_2$-subspace embedding through uniform sampling. The transition to uniform sampling is done without applying a random projection, as in the case of the subsampled randomized Hadamard transform. Furthermore, by incorporating this technique to coded computing, our scheme is an iterative sketching approach to approximately solving linear regression. We also propose weighting when sketching takes place through sampling with replacement, for further compression.
Data-Based Design of Multi-Model Inferential Sensors
Mojto, Martin, Lubušký, Karol, Fikar, Miroslav, Paulen, Radoslav
The nonlinear character of industrial processes is usually the main limitation to designing simple linear inferential sensors with sufficient accuracy. In order to increase the inferential sensor predictive performance and yet to maintain its linear structure, multi-model inferential sensors represent a straightforward option. In this contribution, we propose two novel approaches for the design of multi-model inferential sensors aiming to mitigate some drawbacks of the state-of-the-art approaches. For a demonstration of the developed techniques, we design inferential sensors for a Vacuum Gasoil Hydrogenation unit, which is a real-world petrochemical refinery unit. The performance of the multi-model inferential sensor is compared against various single-model inferential sensors and the current (referential) inferential sensor used in the refinery. The results show substantial improvements over the state-of-the-art design techniques for single-/multi-model inferential sensors.
Structured Low-Rank Tensors for Generalized Linear Models
Taki, Batoul, Sarwate, Anand D., Bajwa, Waheed U.
Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model -- is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSR-structured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tucker-structured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples.
OBESEYE: Interpretable Diet Recommender for Obesity Management using Machine Learning and Explainable AI
Roy, Mrinmoy, Das, Srabonti, Protity, Anica Tasnim
Obesity, the leading cause of many non-communicable diseases, occurs mainly for eating more than our body requirements and lack of proper activity. So, being healthy requires heathy diet plans, especially for patients with comorbidities. But it is difficult to figure out the exact quantity of each nutrient because nutrients requirement varies based on physical and disease conditions. In our study we proposed a novel machine learning based system to predict the amount of nutrients one individual requires for being healthy. We applied different machine learning algorithms: linear regression, support vector machine (SVM), decision tree, random forest, XGBoost, LightGBM on fluid and 3 other major micronutrients: carbohydrate, protein, fat consumption prediction. We achieved high accuracy with low root mean square error (RMSE) by using linear regression in fluid prediction, random forest in carbohydrate prediction and LightGBM in protein and fat prediction. We believe our diet recommender system, OBESEYE, is the only of its kind which recommends diet with the consideration of comorbidities and physical conditions and promote encouragement to get rid of obesity.
Adapting to Change: Robust Counterfactual Explanations in Dynamic Data Landscapes
Prenkaj, Bardh, Villaizan-Vallelado, Mario, Leemann, Tobias, Kasneci, Gjergji
We introduce a novel semi-supervised Graph Counterfactual Explainer (GCE) methodology, Dynamic GRAph Counterfactual Explainer (DyGRACE). It leverages initial knowledge about the data distribution to search for valid counterfactuals while avoiding using information from potentially outdated decision functions in subsequent time steps. Employing two graph autoencoders (GAEs), DyGRACE learns the representation of each class in a binary classification scenario. The GAEs minimise the reconstruction error between the original graph and its learned representation during training. The method involves (i) optimising a parametric density function (implemented as a logistic regression function) to identify counterfactuals by maximising the factual autoencoder's reconstruction error, (ii) minimising the counterfactual autoencoder's error, and (iii) maximising the similarity between the factual and counterfactual graphs. This semi-supervised approach is independent of an underlying black-box oracle. A logistic regression model is trained on a set of graph pairs to learn weights that aid in finding counterfactuals. At inference, for each unseen graph, the logistic regressor identifies the best counterfactual candidate using these learned weights, while the GAEs can be iteratively updated to represent the continual adaptation of the learned graph representation over iterations. DyGRACE is quite effective and can act as a drift detector, identifying distributional drift based on differences in reconstruction errors between iterations. It avoids reliance on the oracle's predictions in successive iterations, thereby increasing the efficiency of counterfactual discovery. DyGRACE, with its capacity for contrastive learning and drift detection, will offer new avenues for semi-supervised learning and explanation generation.
Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities
Yang, Hanchen, Li, Wengen, Wang, Shuyu, Li, Hui, Guan, Jihong, Zhou, Shuigeng, Cao, Jiannong
With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science.
Bringing Chemistry to Scale: Loss Weight Adjustment for Multivariate Regression in Deep Learning of Thermochemical Processes
Rohrhofer, Franz M., Posch, Stefan, Gößnitzer, Clemens, García-Oliver, José M., Geiger, Bernhard C.
Flamelet models are widely used in computational fluid dynamics to simulate thermochemical processes in turbulent combustion. These models typically employ memory-expensive lookup tables that are predetermined and represent the combustion process to be simulated. Artificial neural networks (ANNs) offer a deep learning approach that can store this tabular data using a small number of network weights, potentially reducing the memory demands of complex simulations by orders of magnitude. However, ANNs with standard training losses often struggle with underrepresented targets in multivariate regression tasks, e.g., when learning minor species mass fractions as part of lookup tables. This paper seeks to improve the accuracy of an ANN when learning multiple species mass fractions of a hydrogen (\ce{H2}) combustion lookup table. We assess a simple, yet effective loss weight adjustment that outperforms the standard mean-squared error optimization and enables accurate learning of all species mass fractions, even of minor species where the standard optimization completely fails. Furthermore, we find that the loss weight adjustment leads to more balanced gradients in the network training, which explains its effectiveness.
Exact identification of nonlinear dynamical systems by Trimmed Lasso
Kiser, Shawn L., Guskov, Mikhail, Rébillat, Marc, Ranc, Nicolas
Identification of nonlinear dynamical systems has been popularized by sparse identification of the nonlinear dynamics (SINDy) via the sequentially thresholded least squares (STLS) algorithm. Many extensions SINDy have emerged in the literature to deal with experimental data which are finite in length and noisy. Recently, the computationally intensive method of ensembling bootstrapped SINDy models (E-SINDy) was proposed for model identification, handling finite, highly noisy data. While the extensions of SINDy are numerous, their sparsity-promoting estimators occasionally provide sparse approximations of the dynamics as opposed to exact recovery. Furthermore, these estimators suffer under multicollinearity, e.g. the irrepresentable condition for the Lasso. In this paper, we demonstrate that the Trimmed Lasso for robust identification of models (TRIM) can provide exact recovery under more severe noise, finite data, and multicollinearity as opposed to E-SINDy. Additionally, the computational cost of TRIM is asymptotically equal to STLS since the sparsity parameter of the TRIM can be solved efficiently by convex solvers. We compare these methodologies on challenging nonlinear systems, specifically the Lorenz 63 system, the Bouc Wen oscillator from the nonlinear dynamics benchmark of No\"el and Schoukens, 2016, and a time delay system describing tool cutting dynamics. This study emphasizes the comparisons between STLS, reweighted $\ell_1$ minimization, and Trimmed Lasso in identification with respect to problems faced by practitioners: the problem of finite and noisy data, the performance of the sparse regression of when the library grows in dimension (multicollinearity), and automatic methods for choice of regularization parameters.
Distribution-Free Inference for the Regression Function of Binary Classification
Tamás, Ambrus, Csáji, Balázs Csanád
One of the key objects of binary classification is the regression function, i.e., the conditional expectation of the class labels given the inputs. With the regression function not only a Bayes optimal classifier can be defined, but it also encodes the corresponding misclassification probabilities. The paper presents a resampling framework to construct exact, distribution-free and non-asymptotically guaranteed confidence regions for the true regression function for any user-chosen confidence level. Then, specific algorithms are suggested to demonstrate the framework. It is proved that the constructed confidence regions are strongly consistent, that is, any false model is excluded in the long run with probability one. The exclusion is quantified with probably approximately correct type bounds, as well. Finally, the algorithms are validated via numerical experiments, and the methods are compared to approximate asymptotic confidence ellipsoids.
Causal Discovery from Temporal Data: An Overview and New Perspectives
Gong, Chang, Yao, Di, Zhang, Chuzhe, Li, Wenbin, Bi, Jingping
Temporal data, representing chronological observations of complex systems, has always been a typical data structure that can be widely generated by many domains, such as industry, medicine and finance. Analyzing this type of data is extremely valuable for various applications. Thus, different temporal data analysis tasks, eg, classification, clustering and prediction, have been proposed in the past decades. Among them, causal discovery, learning the causal relations from temporal data, is considered an interesting yet critical task and has attracted much research attention. Existing causal discovery works can be divided into two highly correlated categories according to whether the temporal data is calibrated, ie, multivariate time series causal discovery, and event sequence causal discovery. However, most previous surveys are only focused on the time series causal discovery and ignore the second category. In this paper, we specify the correlation between the two categories and provide a systematical overview of existing solutions. Furthermore, we provide public datasets, evaluation metrics and new perspectives for temporal data causal discovery.