Goto

Collaborating Authors

 megp


From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification

arXiv.org Artificial Intelligence

We study symbolic surrogate modeling of frozen Transformer embeddings to obtain compact, auditable classifiers with calibrated probabilities. For five benchmarks (SST2G, 20NG, MNIST, CIFAR10, MSC17), embeddings from ModernBERT, DINOv2, and SigLIP are partitioned on the training set into disjoint, information-preserving views via semantic-preserving feature partitioning (SPFP). A cooperative multi-population genetic program (MEGP) then learns additive, closed-form logit programs over these views. Across 30 runs per dataset we report F1, AUC, log-loss, Brier, expected calibration error (ECE), and symbolic complexity; a canonical model is chosen by a one-standard-error rule on validation F1 with a parsimony tie-break. Temperature scaling fitted on validation yields substantial ECE reductions on test. The resulting surrogates achieve strong discrimination (up to F1 around 0.99 on MNIST, CIFAR10, MSC17; around 0.95 on SST2G), while 20NG remains most challenging. We provide reliability diagrams, dimension usage and overlap statistics, contribution-based importances, and global effect profiles (PDP and ALE), demonstrating faithful, cross-modal explanations grounded in explicit programs.


Enhanced Genetic Programming Models with Multiple Equations for Accurate Semi-Autogenous Grinding Mill Throughput Prediction

arXiv.org Artificial Intelligence

Semi-autogenous grinding (SAG) mills play a pivotal role in the grinding circuit of mineral processing plants. Accurate prediction of SAG mill throughput as a crucial performance metric is of utmost importance. The potential of applying genetic programming (GP) for this purpose has yet to be thoroughly investigated. This study introduces an enhanced GP approach entitled multi-equation GP (MEGP) for more accurate prediction of SAG mill throughput. In the new proposed method multiple equations, each accurately predicting mill throughput for specific clusters of training data are extracted. These equations are then employed to predict mill throughput for test data using various approaches. To assess the effect of distance measures, four different distance measures are employed in MEGP method. Comparative analysis reveals that the best MEGP approach achieves an average improvement of 10.74% in prediction accuracy compared with standard GP. In this approach, all extracted equations are utilized and both the number of data points in each data cluster and the distance to clusters are incorporated for calculating the final prediction. Further investigation of distance measures indicates that among four different metrics employed including Euclidean, Manhattan, Chebyshev, and Cosine distance, the Euclidean distance measure yields the most accurate results for the majority of data splits.