AITopics

2309.09993

Country:

Asia > Middle East > Yemen > Amran Governorate > Amran (0.06)
Asia > Middle East > Iraq > Al Qadisiyah Governorate (0.05)
Asia > China > Hubei Province > Wuhan (0.05)
(7 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.34)

Gahar, Rania Mkhinini, Hidri, Adel, Hidri, Minyar Sassi

Let's Predict Who Will Move to a New Job

arXiv.org Artificial IntelligenceSep-15-2023

Any company's human resources department faces the challenge of predicting whether an applicant will search for a new job or stay with the company. In this paper, we discuss how machine learning (ML) is used to predict who will move to a new job. First, the data is pre-processed into a suitable format for ML models. To deal with categorical features, data encoding is applied and several MLA (ML Algorithms) are performed including Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and eXtreme Gradient Boosting (XGBoost). To improve the performance of ML models, the synthetic minority oversampling technique (SMOTE) is used to retain them. Models are assessed using decision support metrics such as precision, recall, F1-Score, and accuracy.

categorical variable, category, minority class, (15 more...)

doi: 10.1109/IC_ASET58101.2023.10150675

2309.08333

Country:

Asia > Middle East > Saudi Arabia > Eastern Province > Dammam (0.05)
Africa > Middle East > Tunisia > Tunis Governorate > Tunis (0.05)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (0.91)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Fouodo, Cesaire J. K., Kronziel, Lea L., König, Inke R., Szymczak, Silke

Effect of hyperparameters on variable selection in random forests

arXiv.org Machine LearningSep-13-2023

Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables (mtry.prop) and the sample fraction (sample.fraction) for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of mtry is optimal, but smaller values of sample.fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample.fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.

artificial intelligence, machine learning, predictor variable, (17 more...)

2309.06943

Country:

Europe > Germany > Schleswig-Holstein (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)

Martín-Baos, José Ángel, López-Gómez, Julio Alberto, Rodriguez-Benitez, Luis, Hillel, Tim, García-Ródenas, Ricardo

A prediction and behavioural analysis of machine learning methods for modelling travel mode choice

arXiv.org Artificial IntelligenceSep-12-2023

The emergence of a variety of Machine Learning (ML) approaches for travel mode choice prediction poses an interesting question to transport modellers: which models should be used for which applications? The answer to this question goes beyond simple predictive performance, and is instead a balance of many factors, including behavioural interpretability and explainability, computational complexity, and data efficiency. There is a growing body of research which attempts to compare the predictive performance of different ML classifiers with classical random utility models. However, existing studies typically analyse only the disaggregate predictive performance, ignoring other aspects affecting model choice. Furthermore, many studies are affected by technical limitations, such as the use of inappropriate validation schemes, incorrect sampling for hierarchical data, lack of external validation, and the exclusive use of discrete metrics. We address these limitations by conducting a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice (out-of-sample predictive performance, accuracy of predicted market shares, extraction of behavioural indicators, and computational efficiency). We combine several real world datasets with synthetic datasets, where the data generation function is known. The results indicate that the models with the highest disaggregate predictive performance (namely extreme gradient boosting and random forests) provide poorer estimates of behavioural indicators and aggregate mode shares, and are more expensive to estimate, than other models, including deep neural networks and Multinomial Logit (MNL). It is further observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.

dataset, probability, shap value, (16 more...)

doi: 10.1016/j.trc.2023.104318

2301.04404

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Spain > Castilla-La Mancha (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
(2 more...)

arXiv.org Machine LearningSep-9-2023

Online GentleAdaBoost -- Technical Report

Siu, Chapman

Boosting algorithms belong to a class of ensemble classification approaches which use weak assumptions on the learner to efficient manner to improve performance. GentleBoost is an algorithm which was first introduced as an alternative Adaboost approach which uses Newton steps rather than exact optimization on each step (see Friedman, Hastie, and Tibshirani 2000, p353). Unlike other AdaBoost variants, GentleBoost has not received as much attention as it yields empirically inferior performance compared with other Adaboost algorithms when used on a wide range of benchmark datasets. In machine learning, the ability to extend algorithms from a batch setting to an online setting is an important topic. Online approaches can operate on streams and use datasets which are too large to fit in memory. In this technical report we provide an approach to extend GentleBoost to the online setting through using line search. In addition we perform experiments to demonstrate that the algorithm is theoretically sound and has practical usecases.

algorithm, artificial intelligence, machine learning, (16 more...)

2308.14004

Country: North America > United States > New York > New York County > New York City (0.05)

Genre: Research Report (0.53)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.51)

Prasad-Rao, Jubilee, Heidary, Roohollah, Williams, Jesse

Detecting Manufacturing Defects in PCBs via Data-Centric Machine Learning on Solder Paste Inspection Features

arXiv.org Artificial IntelligenceSep-6-2023

Automated detection of defects in Printed Circuit Board (PCB) manufacturing using Solder Paste Inspection (SPI) and Automated Optical Inspection (AOI) machines can help improve operational efficiency and significantly reduce the need for manual intervention. In this paper, using SPI-extracted features of 6 million pins, we demonstrate a data-centric approach to train Machine Learning (ML) models to detect PCB defects at three stages of PCB manufacturing. The 6 million PCB pins correspond to 2 million components that belong to 15,387 PCBs. Using a base extreme gradient boosting (XGBoost) ML model, we iterate on the data pre-processing step to improve detection performance. Combining pin-level SPI features using component and PCB IDs, we developed training instances also at the component and PCB level. This allows the ML model to capture any inter-pin, inter-component, or spatial effects that may not be apparent at the pin level. Models are trained at the pin, component, and PCB levels, and the detection results from the different models are combined to identify defective components.

data-centric machine learning, detecting manufacturing defect, solder paste inspection feature, (1 more...)

2309.03113

Genre: Research Report (0.40)

Industry: Law > Torts Law (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)

arXiv.org Machine LearningSep-5-2023

Monotone Tree-Based GAMI Models by Adapting XGBoost

Hu, Linwei, Aramideh, Soroush, Chen, Jie, Nair, Vijayan N.

Recent papers have used machine learning architecture to fit low-order functional ANOVA models with main effects and second-order interactions. These GAMI (GAM + Interaction) models are directly interpretable as the functional main effects and interactions can be easily plotted and visualized. Unfortunately, it is not easy to incorporate the monotonicity requirement into the existing GAMI models based on boosted trees, such as EBM (Lou et al. 2013) and GAMI-Lin-T (Hu et al. 2022). This paper considers models of the form $f(x)=\sum_{j,k}f_{j,k}(x_j, x_k)$ and develops monotone tree-based GAMI models, called monotone GAMI-Tree, by adapting the XGBoost algorithm. It is straightforward to fit a monotone model to $f(x)$ using the options in XGBoost. However, the fitted model is still a black box. We take a different approach: i) use a filtering technique to determine the important interactions, ii) fit a monotone XGBoost algorithm with the selected interactions, and finally iii) parse and purify the results to get a monotone GAMI model. Simulated datasets are used to demonstrate the behaviors of mono-GAMI-Tree and EBM, both of which use piecewise constant fits. Note that the monotonicity requirement is for the full model. Under certain situations, the main effects will also be monotone. But, as seen in the examples, the interactions will not be monotone.

artificial intelligence, interaction, machine learning, (16 more...)

2309.02426

Country: North America > United States (0.04)

Genre: Research Report (0.40)

Industry: Transportation (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

arXiv.org Artificial IntelligenceSep-4-2023

Customs Import Declaration Datasets

Jeong, Chaeyoon, Kim, Sundong, Park, Jaewoo, Choi, Yeonsoo

Given the huge volume of cross-border flows, effective and efficient control of trade becomes more crucial in protecting people and society from illicit trade. However, limited accessibility of the transaction-level trade datasets hinders the progress of open research, and lots of customs administrations have not benefited from the recent progress in data-based risk management. In this paper, we introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains, such as data science and machine learning. The dataset contains 54,000 artificially generated trades with 22 key attributes, and it is synthesized with conditional tabular GAN while maintaining correlated features. Synthetic data has several advantages. First, releasing the dataset is free from restrictions that do not allow disclosing the original import data. The fabrication step minimizes the possible identity risk which may exist in trade statistics. Second, the published data follow a similar distribution to the source data so that it can be used in various downstream tasks. Hence, our dataset can be used as a benchmark for testing the performance of any classification algorithm. With the provision of data and its generation process, we open baseline codes for fraud detection tasks, as we empirically show that more advanced algorithms can better detect fraud.

dataset, declaration, synthetic data, (13 more...)

2208.02484

Country:

North America > United States > California > Los Angeles County > Long Beach (0.05)
Asia > South Korea > Daejeon > Daejeon (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > South Korea > Gwangju > Gwangju (0.04)

Genre:

Research Report (0.65)
Instructional Material (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government (0.95)
Law Enforcement & Public Safety > Fraud (0.71)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Krupkin, Ian, Hardin, Johanna

Prediction Error Estimation in Random Forests

arXiv.org Machine LearningSep-1-2023

In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which were given for logistic regression. We further show that this result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.

artificial intelligence, machine learning, random forest, (15 more...)

2309.00736

Country: North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > New Finding (0.50)
Research Report > Experimental Study (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.79)

arXiv.org Artificial IntelligenceAug-30-2023

Improving Robustness and Accuracy of Ponzi Scheme Detection on Ethereum Using Time-Dependent Features

Huynh, Phuong Duy, Dau, Son Hoang, Li, Xiaodong, Luong, Phuc, Viterbo, Emanuele

The rapid development of blockchain has led to more and more funding pouring into the cryptocurrency market, which also attracted cybercriminals' interest in recent years. The Ponzi scheme, an old-fashioned fraud, is now popular on the blockchain, causing considerable financial losses to many crypto-investors. A few Ponzi detection methods have been proposed in the literature, most of which detect a Ponzi scheme based on its smart contract source code or opcode. The contract-code-based approach, while achieving very high accuracy, is not robust: first, the source codes of a majority of contracts on Ethereum are not available, and second, a Ponzi developer can fool a contract-code-based detection model by obfuscating the opcode or inventing a new profit distribution logic that cannot be detected (since these models were trained on existing Ponzi logics only). A transaction-based approach could improve the robustness of detection because transactions, unlike smart contracts, are harder to be manipulated. However, the current transaction-based detection models achieve fairly low accuracy. We address this gap in the literature by developing new detection models that rely only on the transactions, hence guaranteeing the robustness, and moreover, achieve considerably higher Accuracy, Precision, Recall, and F1-score than existing transaction-based models. This is made possible thanks to the introduction of novel time-dependent features that capture Ponzi behaviours characteristics derived from our comprehensive data analyses on Ponzi and non-Ponzi data from the XBlock-ETH repository

application, contract, transaction, (15 more...)

2308.16391

Country:

North America > United States > Hawaii (0.04)
Africa > Middle East > Djibouti > Arta > `Arta (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)