AITopics

2406.06728

Country:

North America > United States (0.46)
Asia > India (0.04)
Asia > Singapore (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Nephrology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(4 more...)

arXiv.org Artificial IntelligenceJun-10-2024

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

Liu, Quangao, Yang, Wei, Liang, Chen, Pang, Longlong, Zou, Zhuozhang

Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

categorical feature, dataset, feature identifier, (14 more...)

2406.06891

Country: Asia > China > Liaoning Province > Shenyang (0.06)

Genre: Research Report > Promising Solution (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Janssen, Joseph, Tootchi, Ardalan, Ameli, Ali A.

A critical appraisal of water table depth estimation: Challenges and opportunities within machine learning

arXiv.org Machine LearningJun-9-2024

Fine-resolution spatial patterns of water table depth (WTD) play a crucial role in shaping ecological resilience, hydrological connectivity, and anthropocentric objectives. Generally, a large-scale (e.g., continental or global) spatial map of static WTD can be simulated using either physically-based (PB) or machine learning-based (ML) models. We construct three fine-resolution (500 m) ML simulations of WTD, using the XGBoost algorithm and more than 20 million real and proxy observations of WTD, across the United States and Canada. The three ML models were constrained using known physical relations between WTD's drivers and WTD and were trained by sequentially adding real and proxy observations of WTD. We interpret the black box of our physically constrained ML models and compare it against available literature in groundwater hydrology. Through an extensive (pixel-by-pixel) evaluation, we demonstrate that our models can more accurately predict unseen real and proxy observations of WTD across most of North America's ecoregions compared to three available PB simulations of WTD. However, we still argue that large-scale WTD estimation is far from being a solved problem. We reason that due to biased observational data mainly collected from low-elevation floodplains, the misspecification of equations within physically-based models, and the over-flexibility of machine learning models, verifiably accurate simulations of WTD do not yet exist. Ultimately, we thoroughly discuss future directions that may help hydrogeologists decide how to proceed with WTD estimations, with a particular focus on the application of machine learning and the use of proxy satellite data.

artificial intelligence, machine learning, simulation, (20 more...)

2405.04579

Country:

North America > United States (1.00)
North America > Canada (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

arXiv.org Artificial IntelligenceJun-7-2024

Advanced Payment Security System:XGBoost, CatBoost and SMOTE Integrated

Zheng, Qi, Yu, Chang, Cao, Jin, Xu, Yongshun, Xing, Qianwen, Jin, Yinxin

With the rise of various online and mobile payment systems, transaction fraud has become a significant threat to financial security. This study explores the application of advanced machine learning models, specifically XGBoost and LightGBM, for developing a more accurate and robust Payment Security Protection Model.To enhance data reliability, we meticulously processed the data sources and used SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve data representation. By selecting highly correlated features, we aimed to strengthen the training process and boost model performance.We conducted thorough performance evaluations of our proposed models, comparing them against traditional methods including Random Forest, Neural Network, and Logistic Regression. Key metrics such as Precision, Recall, and F1 Score were used to rigorously assess their effectiveness.Our detailed analyses and comparisons reveal that the combination of SMOTE with XGBoost and LightGBM offers a highly efficient and powerful mechanism for payment security protection. The results show that these models not only outperform traditional approaches but also hold significant promise for advancing the field of transaction fraud prevention.

dataset, smote, xgboost, (14 more...)

2406.04658

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > United States > Massachusetts > Middlesex County > Lowell (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

arXiv.org Machine LearningJun-6-2024

Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension

Ni, Shuang, Aumon, Adrien, Wolf, Guy, Moon, Kevin R., Rhodes, Jake S.

The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.

architecture, proximity, rf-phate, (15 more...)

2406.04421

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Utah (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Mejía-Fragoso, Juan Camilo, Florez, Manuel A., Bernal-Olaya, Rocío

Predicting the Geothermal Gradient in Colombia: a Machine Learning Approach

arXiv.org Artificial IntelligenceJun-5-2024

Accurate determination of the geothermal gradient is critical for assessing the geothermal energy potential of a given region. Of particular interest is the case of Colombia, a country with abundant geothermal resources. A history of active oil and gas exploration and production has left drilled boreholes in different geological settings, providing direct measurements of the geothermal gradient. Unfortunately, large regions of the country where geothermal resources might exist lack such measurements. Indirect geophysical measurements are costly and difficult to perform at regional scales. Computational thermal models could be constructed, but they require very detailed knowledge of the underlying geology and uniform sampling of subsurface temperatures to be well-constrained. We present an alternative approach that leverages recent advances in supervised machine learning and available direct measurements to predict the geothermal gradient in regions where only global-scale geophysical datasets and course geological knowledge are available. We find that a Gradient Boosted Regression Tree algorithm yields optimal predictions and extensively validate the trained model. We show that predictions of our model are within 12% accuracy and that independent measurements performed by other authors agree well with our model. Finnally, we present a geothermal gradient map for Colombia that highlights regions where futher exploration and data collection should be performed.

colombia, geothermal gradient, gradient, (14 more...)

doi: 10.1016/j.geothermics.2024.103074

2404.05184

Country:

North America > Panama (0.14)
North America > United States > Texas (0.14)
Antarctica (0.04)
(15 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development (1.00)
Energy > Renewable > Geothermal > Geothermal Resource Type (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.67)

Bergonzoli, Giulia, Rossi, Lidia, Masci, Chiara

Ordinal Mixed-Effects Random Forest

arXiv.org Machine LearningJun-5-2024

We propose an innovative statistical method, called Ordinal Mixed-Effect Random Forest (OMERF), that extends the use of random forest to the analysis of hierarchical data and ordinal responses. The model preserves the flexibility and ability of modeling complex patterns of both categorical and continuous variables, typical of tree-based ensemble methods, and, at the same time, takes into account the structure of hierarchical data, modeling the dependence structure induced by the grouping and allowing statistical inference at all data levels. A simulation study is conducted to validate the performance of the proposed method and to compare it to the one of other state-of-the art models. The application of OMERF is exemplified in a case study focusing on predicting students performances using data from the Programme for International Student Assessment (PISA) 2022. The model identifies discriminating student characteristics and estimates the school-effect.

numeric mean, omerf, random forest, (15 more...)

2406.0313

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.04)
Europe > Italy > Lombardy > Milan (0.04)
North America > Canada > British Columbia > Regional District of Central Okanagan > Kelowna (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Education > Assessment & Standards > Student Performance (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.85)

Sluijterman, Laurens, Kreuwel, Frank, Cator, Eric, Heskes, Tom

Composite Quantile Regression With XGBoost Using the Novel Arctan Pinball Loss

arXiv.org Machine LearningJun-4-2024

This paper explores the use of XGBoost for composite quantile regression. XGBoost is a highly popular model renowned for its flexibility, efficiency, and capability to deal with missing data. The optimization uses a second order approximation of the loss function, complicating the use of loss functions with a zero or vanishing second derivative. Quantile regression -- a popular approach to obtain conditional quantiles when point estimates alone are insufficient -- unfortunately uses such a loss function, the pinball loss. Existing workarounds are typically inefficient and can result in severe quantile crossings. In this paper, we present a smooth approximation of the pinball loss, the arctan pinball loss, that is tailored to the needs of XGBoost. Specifically, contrary to other smooth approximations, the arctan pinball loss has a relatively large second derivative, which makes it more suitable to use in the second order approximation. Using this loss function enables the simultaneous prediction of multiple quantiles, which is more efficient and results in far fewer quantile crossings.

pinball loss, quantile, second derivative, (12 more...)

2406.02293

Country:

Europe > Netherlands > Gelderland > Nijmegen (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Netherlands > Gelderland > Arnhem (0.04)
Asia > Middle East > Republic of Türkiye > Antalya Province > Antalya (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine (1.00)
Energy (1.00)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Jaakkola, Reijo, Janhunen, Tomi, Kuusisto, Antti, Rankooh, Masood Feyzbakhsh, Vilander, Miikka

Globally Interpretable Classifiers via Boolean Formulas with Dynamic Propositions

arXiv.org Artificial IntelligenceJun-3-2024

Interpretability and explainability are among the most important challenges of modern artificial intelligence, being mentioned even in various legislative sources. In this article, we develop a method for extracting immediately human interpretable classifiers from tabular data. The classifiers are given in the form of short Boolean formulas built with propositions that can either be directly extracted from categorical attributes or dynamically computed from numeric ones. Our method is implemented using Answer Set Programming. We investigate seven datasets and compare our results to ones obtainable by state-of-the-art classifiers for tabular data, namely, XGBoost and random forests. Over all datasets, the accuracies obtainable by our method are similar to the reference methods. The advantage of our classifiers in all cases is that they are very short and immediately human intelligible as opposed to the black-box nature of the reference methods.

boolean formula, dynamic proposition, globally interpretable classifier

2406.01114

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)

Mangina, Vamsi Sai Ranga Sri Harsha

Adaptive boosting with dynamic weight adjustment

arXiv.org Artificial IntelligenceJun-1-2024

Adaptive Boosting with Dynamic Weight complex relationships among the data, we can use Adaptive Boosting with Dynamic Weight Adjustment. Adjustment is an enhancement of the traditional Adaptive Adaptive Boosting with Dynamic Weight Adjustment is an boosting commonly known as AdaBoost, a powerful enhancement of the traditional AdaBoost technique where ensemble learning technique. Adaptive Boosting with the weight updation process in Adaptive Boosting with Dynamic Weight Adjustment technique improves the Dynamic Weight Adjustment is more adaptive by taking efficiency and accuracy by dynamically updating the classification errors and the overall error distribution and weights of the instances based on prediction error where the based on the individual instances. This enables our model weights are updated in proportion to the error rather than to work with multiclass and more complex data efficiently, updating weights uniformly as we do in traditional enhancing the performance and its efficiency compared to Adaboost.

adaboost, adjustment, learner, (11 more...)

2406.00524

Genre:

Research Report (1.00)
Overview (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.31)