training subset
MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation
Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which creates several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets. Simulation studies suggest that our approach yields prediction performance comparable to that of these recently proposed methods.
- North America (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Germany > Rheinland-Pfalz > Mainz (0.04)
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
Chen, Guoguo, Chai, Shuzhou, Wang, Guanbo, Du, Jiayu, Zhang, Wei-Qiang, Weng, Chao, Su, Dan, Povey, Daniel, Trmal, Jan, Zhang, Junbo, Jin, Mingjie, Khudanpur, Sanjeev, Watanabe, Shinji, Zhao, Shuaijiang, Zou, Wei, Li, Xiangang, Yao, Xuchen, Wang, Yongqing, Wang, Yujun, You, Zhao, Yan, Zhiyong
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
- Health & Medicine (0.46)
- Media (0.36)
Exploring Gender Disparities in Automatic Speech Recognition Technology
ElGhazaly, Hend, Mirheidari, Bahman, Moosavi, Nafise Sadat, Christensen, Heidi
This study investigates factors influencing Automatic Speech Recognition (ASR) systems' fairness and performance across genders, beyond the conventional examination of demographics. Using the LibriSpeech dataset and the Whisper small model, we analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance. Optimal fairness occurs at specific gender distributions rather than a simple 50-50 split. Furthermore, our findings suggest that factors like pitch variability can significantly affect ASR accuracy. This research contributes to a deeper understanding of biases in ASR systems, highlighting the importance of carefully curated training data in mitigating gender bias.
Black-box optimization and quantum annealing for filtering out mislabeled training instances
Otsuka, Makoto, Kodama, Kento, Morita, Keisuke, Ohzeki, Masayuki
This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method's ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave's clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij's simulated quantum annealing sampler or Neal's simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Burnaby (0.04)
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
- (2 more...)
An Extensive Study on D2C: Overfitting Remediation in Deep Learning Using a Decentralized Approach
Siddiqui, Md. Saiful Bari, Islam, Md Mohaiminul, Alam, Md. Golam Rabiul
Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, we propose Divide2Conquer (D2C), a novel technique to mitigate overfitting. D2C partitions the training data into multiple subsets and trains identical models independently on each subset. To balance model generalization and subset-specific learning, the model parameters are periodically aggregated and averaged during training. This process enables the learning of robust patterns while minimizing the influence of outliers and noise. Empirical evaluations on benchmark datasets across diverse deep-learning tasks demonstrate that D2C significantly enhances generalization performance, particularly with larger datasets. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting D2C's effectiveness both as a standalone technique and in combination with other overfitting reduction methods. We further provide a rigorous mathematical justification for D2C's underlying principles and examine its applicability across multiple domains. Finally, we explore the trade-offs associated with D2C and propose strategies to address them, offering a holistic view of its strengths and limitations. This study establishes D2C as a versatile and effective approach to combating overfitting in deep learning. Our codes are publicly available at: https://github.com/Saiful185/Divide2Conquer.
- North America > Canada > Ontario > Toronto (0.28)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- North America > United States (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models
Tayaranian, Mohammadreza, Mozafari, Seyyed Hasan, Meyer, Brett H., Clark, James J., Gross, Warren J.
Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average 3 smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a 0.1% increase in the evaluation performance of the model. Transformer-based language models have shown state-of-the-art performance in various natural language understanding tasks (Liu et al., 2019; Raffel et al., 2020). These models are commonly used in a transfer learning setup in which they are first pre-trained on general textual data and then transferred by fine-tuning their parameters on the training set of each downstream task. The goal of fine-tuning is to maximise the model's performance on the evaluation set However, different data points in the fine-tuning dataset have different contributions in achieving this goal (Katharopoulos & Fleuret, 2018).
- North America > Canada > Quebec > Montreal (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
On the Inflation of KNN-Shapley Value
Yang, Ziao, Yue, Han, Chen, Jian, Liu, Hongfu
Shapley value-based data valuation methods, originating from cooperative game theory, quantify the usefulness of each individual sample by considering its contribution to all possible training subsets. Despite their extensive applications, these methods encounter the challenge of value inflation - while samples with negative Shapley values are detrimental, some with positive values can also be harmful. This challenge prompts two fundamental questions: the suitability of zero as a threshold for distinguishing detrimental from beneficial samples and the determination of an appropriate threshold. To address these questions, we focus on KNN-Shapley and propose Calibrated KNN-Shapley (CKNN-Shapley), which calibrates zero as the threshold to distinguish detrimental samples from beneficial ones by mitigating the negative effects of small-sized training subsets. Through extensive experiments, we demonstrate the effectiveness of CKNN-Shapley in alleviating data valuation inflation, detecting detrimental samples, and assessing data quality. We also extend our approach beyond conventional classification settings, applying it to diverse and practical scenarios such as learning with mislabeled data, online learning with stream data, and active learning for label annotation.
- Europe > France (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Character-level Convolutional Networks for Text Classification
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several largescale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Russia (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (3 more...)
Transformer-based Causal Language Models Perform Clustering
Even though large language models (LLMs) have demonstrated remarkable capability in solving various natural language tasks, the capability of an LLM to follow human instructions is still a concern. Recent works have shown great improvements in the instruction-following capability via additional training for instruction-following tasks. However, the mechanisms responsible for effective instruction-following capabilities remain inadequately understood. Here, we introduce a simplified instruction-following task and use synthetic datasets to analyze a Transformer-based causal language model. Our findings suggest that the model learns task-specific information by clustering data within its hidden space, with this clustering process evolving dynamically during learning. We also demonstrate how this phenomenon assists the model in handling unseen instances, and validate our results in a more realistic setting. Furthermore, we present inspired applications regarding pre-training and alignment.
- North America > United States > New York (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
ModelPred: A Framework for Predicting Trained Model from Training Data
Zeng, Yingyan, Wang, Jiachen T., Chen, Si, Just, Hoang Anh, Jin, Ran, Jia, Ruoxi
In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
- North America > United States > Virginia (0.04)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)