AITopics | k-fold cross-validation

Collaborating Authors

k-fold cross-validation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment

Li, Jia, Wang, Yang, Qian, Wenhao, Hu, Jialong, Hu, Zhenzhen, Hong, Richang, Wang, Meng

arXiv.org Artificial IntelligenceAug-6-2025

Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365'' aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

data mining, dimension, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.22676

Country: Asia > China (0.17)

Genre: Research Report (0.82)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.82)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.71)

Add feedback

A New Flexible Train-Test Split Algorithm, an approach for choosing among the Hold-out, K-fold cross-validation, and Hold-out iteration

Bami, Zahra, Behnampour, Ali, Doosti, Hassan

arXiv.org Artificial IntelligenceJan-11-2025

Artificial Intelligent transformed industries, like engineering, medicine, finance. Predictive models use supervised learning, a vital Machine learning subset. Crucial for model evaluation, cross-validation includes re-substitution, hold-out, and K-fold. This study focuses on improving the accuracy of ML algorithms across three different datasets. To evaluate Hold-out, Hold-out with iteration, and K-fold Cross-Validation techniques, we created a flexible Python program. By modifying parameters like test size, Random State, and 'k' values, we were able to improve accuracy assessment. The outcomes demonstrate the Hold-out validation method's persistent superiority, particularly with a test size of 10%. With iterations and Random State settings, hold-out with iteration shows little accuracy variance. It suggests that there are variances according to algorithm, with Decision Tree doing best for Framingham and Naive Bayes and K Nearest Neighbors for COVID-19. Different datasets require different optimal K values in K-Fold Cross Validation, highlighting these considerations. This study challenges the universality of K values in K-Fold Cross Validation and suggests a 10% test size and 90% training size for better outcomes. It also emphasizes the contextual impact of dataset features, sample size, feature count, and selected methodologies. Researchers can adapt these codes for their dataset to obtain highest accuracy with specific evaluation.

accuracy, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.06492

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.91)
Health & Medicine > Therapeutic Area > Immunology (0.91)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Hocking, Toby Dylan, Thibault, Gabrielle, Bodine, Cameron Scott, Arellano, Paul Nelson, Shenkin, Alexander F, Lindly, Olivia Jasmine

arXiv.org Machine LearningOct-11-2024

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

algorithm, prediction, subset, (13 more...)

arXiv.org Machine Learning

2410.08643

Country:

North America > Canada > Quebec (0.04)
North America > United States > Mississippi > Jackson County > Pascagoula (0.04)
North America > United States > Arizona > Coconino County > Flagstaff (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.47)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Health & Medicine > Therapeutic Area (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Machine learning-based algorithms for at-home respiratory disease monitoring and respiratory assessment

Orangi-Fard, Negar, Bogdan, Alexandru, Sagreiya, Hersh

arXiv.org Artificial IntelligenceSep-4-2024

Respiratory diseases impose a significant burden on global health, with current diagnostic and management practices primarily reliant on specialist clinical testing. This work aims to develop machine learning-based algorithms to facilitate at-home respiratory disease monitoring and assessment for patients undergoing continuous positive airway pressure (CPAP) therapy. Data were collected from 30 healthy adults, encompassing respiratory pressure, flow, and dynamic thoraco-abdominal circumferential measurements under three breathing conditions: normal, panting, and deep breathing. Various machine learning models, including the random forest classifier, logistic regression, and support vector machine (SVM), were trained to predict breathing types. The random forest classifier demonstrated the highest accuracy, particularly when incorporating breathing rate as a feature. These findings support the potential of AI-driven respiratory monitoring systems to transition respiratory assessments from clinical settings to home environments, enhancing accessibility and patient autonomy. Future work involves validating these models with larger, more diverse populations and exploring additional machine learning techniques.

algorithm, assessment, at-home respiratory disease monitoring, (10 more...)

arXiv.org Artificial Intelligence

2409.0318

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(3 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Neural Information Processing SystemsFeb-16-2024, 18:14:11 GMT

Most machine learning researchers perform quantitative experiments to estimate generalization error and compare algorithm performances. In order to draw statistically convincing conclusions, it is important to esti- mate the uncertainty of such estimates. This paper studies the estimation of uncertainty around the K-fold cross-validation estimator. The main theorem shows that there exists no universal unbiased estimator of the variance of K-fold cross-validation. An analysis based on the eigende- composition of the covariance matrix of errors helps to better understand the nature of the problem and shows that naive estimators may grossly underestimate variance, as con rmed by numerical experiments.

k-fold cross-validation, unbiased estimator, variance, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.97)

Add feedback

Classification on Hyperspectral Data

#artificialintelligenceDec-18-2021, 09:46:08 GMT

The goal of this tutorial is to apply PCA to hyperspectral data. After reducing the dimensionality of the data using PCA, classify the data by applying the Support Vector Machine(SVM) to classify the different materials in the image. We are using the Hyperspectral Gulfport Dataset in this tutorial. The MUUFL Gulfport data contains the pixel-based ground truth map which was provided by manually labeling the pixels in the scene. The following classes were labeled in the scene trees, mostly grass, ground surface, mixed ground surface, dirt and sand, road, water, buildings, the shadow of buildings, sidewalk, yellow curb, cloth panels (targets), and unlabeled points.

classification, hyperspectral data, pca, (10 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.59)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.59)

Add feedback

Top 7 cross validation techniques with Python Code - Analytics Vidhya

#artificialintelligenceNov-23-2021, 08:36:50 GMT

Not suitable for Time Series data: For Time Series data the order of the samples matter. But in Stratified Cross-Validation, samples are selected in random order. LeavePOut cross-validation is an exhaustive cross-validation technique, in which p-samples are used as the validation set and remaining n-p samples are used as the training set. Suppose we have 100 samples in the dataset. If we use p 10 then in each iteration 10 values will be used as a validation set and the remaining 90 samples as the training set. This process is repeated till the whole dataset gets divided on the validation set of p-samples and n-p training samples. All the data samples get used as both training and validation samples.

dataset, training and validation, validation, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)

Add feedback

A Guide to Selecting Machine Learning Models in Python

#artificialintelligenceJun-18-2021, 06:18:54 GMT

Model testing is a key part of model building. When done correctly, testing ensures your model is stable and isn't overfit. The three most well-known methods of model testing are randomized train-test split, K-fold cross-validation, and leave one out cross-validation. Feature selection is another important part of model building as it directly impacts model performance and interpretability. The simplest method of feature selection is manual, which is ideally guided by domain expertise.

important part, model building, validation, (10 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.40)

Add feedback

Cancer Diagnosis Using Machine Learning

#artificialintelligenceAug-24-2020, 10:55:37 GMT

Thus, we need machine learning algorithms that, using this knowledge base as a baseline allows us to automatically classify the genetic variation. Classify the given genetic variations/mutations based on evidence from text-based clinical literature. We will be using datasets available in Kaggle provided by Memorial Sloan Kettering Cancer Center (MSKCC). You can view details and download the datasets from here. Training_variants is a comma-separated file containing the description of genetic information used for training purposes.

algorithm, artificial intelligence, machine learning, (10 more...)

#artificialintelligence

Industry: Health & Medicine > Therapeutic Area > Oncology (0.71)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Real-World Machine Learning: Model Evaluation & Optimization

#artificialintelligenceMay-21-2018, 21:28:38 GMT

The primary goal of supervised machine learning is accurate prediction. We want our ML model to be as accurate as possible when predicting on new data (for which the target variable is unknown). Said in a different way, we want our models, which have been built from some training data, to generalize well to new data. That way, when we deploy the model in production, we can be assured that the predictions generated are of high quality. Therefore, when we evaluate the performance of a model, we want to determine how well that model will perform on new data.

artificial intelligence, machine learning, new data, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.43)

Add feedback