AITopics

2409.16609

Country: North America > United States (1.00)

Genre:

Workflow (1.00)
Research Report > Promising Solution (0.34)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.90)

F., Paulo C. Marques, Artes, Rinaldo, Graziadei, Helton

Projected random forests and conformal prediction of circular data

arXiv.org Machine LearningDec-25-2024

We apply split conformal prediction techniques to regression problems with circular responses by introducing a suitable conformity score, leading to prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under exchangeable data. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear response regression model into one suitable for circular responses. When random forests serve as basis models in this projection procedure, we harness the out-of-bag dynamics to eliminate the necessity for a separate calibration sample in the construction of prediction sets. For synthetic and real datasets the resulting projected random forests model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, when compared to the split conformal prediction sets generated by two existing alternative models.

artificial intelligence, machine learning, prediction, (19 more...)

2410.24145

Country:

South America > Brazil (0.29)
Europe > Austria (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)

Chen, Kuan-Yu, Chiang, Ping-Han, Chou, Hsin-Rung, Chen, Chih-Sheng, Chang, Tien-Hao

DOFEN: Deep Oblivious Forest ENsemble

arXiv.org Machine LearningDec-24-2024

Deep Neural Networks (DNNs) have revolutionized artificial intelligence, achieving impressive results on diverse data types, including images, videos, and texts. However, DNNs still lag behind Gradient Boosting Decision Trees (GBDT) on tabular data, a format extensively utilized across various domains. In this paper, we propose DOFEN, short for \textbf{D}eep \textbf{O}blivious \textbf{F}orest \textbf{EN}semble, a novel DNN architecture inspired by oblivious decision trees. DOFEN constructs relaxed oblivious decision trees (rODTs) by randomly combining conditions for each column and further enhances performance with a two-level rODT forest ensembling process. By employing this approach, DOFEN achieves state-of-the-art results among DNNs and further narrows the gap between DNNs and tree-based models on the well-recognized benchmark: Tabular Benchmark \citep{grinsztajn2022tree}, which includes 73 total datasets spanning a wide array of domains. The code of DOFEN is available at: \url{https://github.com/Sinopac-Digital-Technology-Division/DOFEN}.

artificial intelligence, deep learning, machine learning, (20 more...)

2412.16534

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Fumagalli, Fabian, Muschalik, Maximilian, Hüllermeier, Eyke, Hammer, Barbara, Herbinger, Julia

Unifying Feature-Based Explanations with Functional ANOVA and Cooperative Game Theory

arXiv.org Machine LearningDec-22-2024

Feature-based explanations, using perturbations or gradients, are a prevalent tool to understand decisions of black box machine learning models. Yet, differences between these methods still remain mostly unknown, which limits their applicability for practitioners. In this work, we introduce a unified framework for local and global feature-based explanations using two well-established concepts: functional ANOVA (fANOVA) from statistics, and the notion of value and interaction from cooperative game theory. We introduce three fANOVA decompositions that determine the influence of feature distributions, and use game-theoretic measures, such as the Shapley value and interactions, to specify the influence of higher-order interactions. Our framework combines these two dimensions to uncover similarities and differences between a wide range of explanation techniques for features and groups of features. We then empirically showcase the usefulness of our framework on synthetic and real-world datasets.

artificial intelligence, machine learning, natural language, (18 more...)

2412.17152

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Austria > Vienna (0.14)
Europe > Italy > Marche > Ancona Province > Ancona (0.04)
(6 more...)

Genre: Research Report (0.81)

Industry: Leisure & Entertainment > Games (0.85)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)
(3 more...)

arXiv.org Artificial IntelligenceDec-21-2024

Back To The Future: A Hybrid Transformer-XGBoost Model for Action-oriented Future-proofing Nowcasting

Sun, Ziheng

The interplay between past, present, and future is a central theme in the iconic movie Back to the Future, where small alterations in past events have profound, cascading effects on the future [1]. This concept mirrors the intricate and often non-linear relationships in real-world systems, where predictions about the future are not merely passive observations but active drivers of current decisions and behaviors. In the film, the characters reshape their present and future by altering past events, reflecting the power of temporal causality--the idea that events in time are interconnected, and that actions taken now have consequences for the future. In much the same way, effective nowcasting--predicting short-term outcomes like weather, natural hazards, health events, or traffic patterns--should not only anticipate what will happen but also incorporate how those predictions can influence present decisions and conditions. Traditional nowcasting methods, however, often focus exclusively on making predictions about future states without considering the active feedback loop that can be created by those predictions [2].

artificial intelligence, machine learning, prediction, (18 more...)

2412.19832

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.95)
Media > Film (0.68)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.90)

arXiv.org Artificial IntelligenceDec-21-2024

Enhancing web traffic attacks identification through ensemble methods and feature selection

Urda, Daniel, Martínez, Branly, Basurto, Nuño, Kull, Meelis, Arroyo, Ángel, Herrero, Álvaro

Websites, as essential digital assets, are highly vulnerable to cyberattacks because of their high traffic volume and the significant impact of breaches. This study aims to enhance the identification of web traffic attacks by leveraging machine learning techniques. A methodology was proposed to extract relevant features from HTTP traces using the CSIC2010 v2 dataset, which simulates e-commerce web traffic. Ensemble methods, such as Random Forest and Extreme Gradient Boosting, were employed and compared against baseline classifiers, including k-nearest Neighbor, LASSO, and Support Vector Machines. The results demonstrate that the ensemble methods outperform baseline classifiers by approximately 20% in predictive accuracy, achieving an Area Under the ROC Curve (AUC) of 0.989. Feature selection methods such as Information Gain, LASSO, and Random Forest further enhance the robustness of these models. This study highlights the efficacy of ensemble models in improving attack detection while minimizing performance variability, offering a practical framework for securing web traffic in diverse application contexts.

artificial intelligence, classifier, machine learning, (17 more...)

2412.16791

Country:

Europe > Spain (0.46)
North America > United States > New York (0.28)

Genre: Research Report > New Finding (0.49)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.48)
Information Technology > Services > e-Commerce Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Chevalier, Dominik, Côté, Marie-Pier

From Point to probabilistic gradient boosting for claim frequency and severity prediction

arXiv.org Machine LearningDec-19-2024

Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalized linear models. Many improvements and sophistications to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive numerical study, we compare their performance on five publicly available datasets for claim frequency and severity, of various size and comprising different number of (high cardinality) categorical variables. We explain how varying exposure-to-risk can be handled with boosting in frequency models. We compare the algorithms on the basis of computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.

algorithm, artificial intelligence, machine learning, (16 more...)

2412.14916

Country:

North America (0.46)
Europe (0.28)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance > Insurance (1.00)
Transportation (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Bøhn, Eivind, Eidnes, Sølve, Jonassen, Kjell Rune

Machine learning in wastewater treatment: insights from modelling a pilot denitrification reactor

arXiv.org Artificial IntelligenceDec-18-2024

Wastewater treatment plants are increasingly recognized as promising candidates for machine learning applications, due to their societal importance and high availability of data. However, their varied designs, operational conditions, and influent characteristics hinder straightforward automation. In this study, we use data from a pilot reactor at the Veas treatment facility in Norway to explore how machine learning can be used to optimize biological nitrate ($\mathrm{NO_3^-}$) reduction to molecular nitrogen ($\mathrm{N_2}$) in the biogeochemical process known as \textit{denitrification}. Rather than focusing solely on predictive accuracy, our approach prioritizes understanding the foundational requirements for effective data-driven modelling of wastewater treatment. Specifically, we aim to identify which process parameters are most critical, the necessary data quantity and quality, how to structure data effectively, and what properties are required by the models. We find that nonlinear models perform best on the training and validation data sets, indicating nonlinear relationships to be learned, but linear models transfer better to the unseen test data, which comes later in time. The variable measuring the water temperature has a particularly detrimental effect on the models, owing to a significant change in distributions between training and test data. We therefore conclude that multiple years of data is necessary to learn robust machine learning models. By addressing foundational elements, particularly in the context of the climatic variability faced by northern regions, this work lays the groundwork for a more structured and tailored approach to machine learning for wastewater treatment. We share publicly both the data and code used to produce the results in the paper.

artificial intelligence, deep learning, machine learning, (12 more...)

2412.1403

Country:

Europe > Norway > Eastern Norway > Oslo (0.04)
Europe > Sweden > Västerbotten County > Umeå (0.04)
Africa > Eswatini > Manzini > Manzini (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Water & Waste Management > Water Management > Water Supplies & Services (1.00)
Water & Waste Management > Water Management > Lifecycle > Treatment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)

arXiv.org Artificial IntelligenceDec-18-2024

Flow Exporter Impact on Intelligent Intrusion Detection Systems

Pinto, Daniela, Vitorino, João, Maia, Eva, Amorim, Ivone, Praça, Isabel

High-quality datasets are critical for training machine learning models, as inconsistencies in feature generation can hinder the accuracy and reliability of threat detection. For this reason, ensuring the quality of the data in network intrusion detection datasets is important. A key component of this is using reliable tools to generate the flows and features present in the datasets. This paper investigates the impact of flow exporters on the performance and reliability of machine learning models for intrusion detection. Using HERA, a tool designed to export flows and extract features, the raw network packets of two widely used datasets, UNSW-NB15 and CIC-IDS2017, were processed from PCAP files to generate new versions of these datasets. These were compared to the original ones in terms of their influence on the performance of several models, including Random Forest, XGBoost, LightGBM, and Explainable Boosting Machine. The results obtained were significant. Models trained on the HERA version of the datasets consistently outperformed those trained on the original dataset, showing improvements in accuracy and indicating a better generalisation. This highlighted the importance of flow generation in the model's ability to differentiate between benign and malicious traffic.

artificial intelligence, dataset, machine learning, (15 more...)

2412.14021

Country:

Europe > Portugal > Porto > Porto (0.04)
Oceania > Australia > New South Wales (0.04)
Europe > Switzerland (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)

Schwerter, Jakob, Romero, Andrés, Dumpert, Florian, Pauly, Markus

Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study

arXiv.org Machine LearningDec-18-2024

Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.

artificial intelligence, imputation method, machine learning, (14 more...)

2412.1357

Country:

Europe > Austria > Vienna (0.14)
Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Education > Educational Setting (0.93)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)