missing value
Interpretable Generalized Additive Models for Datasets with Missing Values
Many important datasets contain samples that are missing one or more feature values. Maintaining the interpretability of machine learning models in the presence of such missing data is challenging. Singly or multiply imputing missing values complicates the model's mapping from features to labels. On the other hand, reasoning on indicator variables that represent missingness introduces a potentially large number of additional terms, sacrificing sparsity. We solve these problems with M-GAM, a sparse, generalized, additive modeling approach that incorporates missingness indicators and their interaction terms while maintaining sparsity through \ell_0 regularization.
Coresets for Clustering with Missing Values
We provide the first coreset for clustering points in \mathbb{R} d that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like \kMeans, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an \epsilon -coreset of a large dataset is a small proxy, usually a reweighted subset of points, that (1 \epsilon) -approximates the clustering objective for every possible center set.Our coresets for k -Means and k -Median clustering have size (jk) {O(\min(j,k))} (\epsilon {-1} d \log n) 2, where n is the number of data points, d is the dimension and j is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for k -Means with missing values [Eiben et al., SODA 2021] to near-linear time.We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets.
Impact of Missing Values in Machine Learning: A Comprehensive Analysis
Ahmad, Abu Fuad, Sayeed, Md Shohel, Alshammari, Khaznah, Ahmed, Istiaque
Machine learning (ML) has become a ubiquitous tool across various domains of data mining and big data analysis. The efficacy of ML models depends heavily on high-quality datasets, which are often complicated by the presence of missing values. Consequently, the performance and generalization of ML models are at risk in the face of such datasets. This paper aims to examine the nuanced impact of missing values on ML workflows, including their types, causes, and consequences. Our analysis focuses on the challenges posed by missing values, including biased inferences, reduced predictive power, and increased computational burdens. The paper further explores strategies for handling missing values, including imputation techniques and removal strategies, and investigates how missing values affect model evaluation metrics and introduces complexities in cross-validation and model selection. The study employs case studies and real-world examples to illustrate the practical implications of addressing missing values. Finally, the discussion extends to future research directions, emphasizing the need for handling missing values ethically and transparently. The primary goal of this paper is to provide insights into the pervasive impact of missing values on ML models and guide practitioners toward effective strategies for achieving robust and reliable model outcomes.
- North America > United States > Iowa > Story County > Ames (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- North America > United States > New Mexico > Doña Ana County > Las Cruces (0.04)
- Asia > Malaysia (0.04)
- Overview (0.93)
- Research Report > New Finding (0.46)
- Health & Medicine (1.00)
- Banking & Finance (0.68)
- Information Technology > Smart Houses & Appliances (0.47)
Can time series forecasting be automated? A benchmark and analysis
Sreedhara, Anvitha Thirthapura, Vanschoren, Joaquin
In the field of machine learning and artificial intelligence, time series forecasting plays a pivotal role across various domains such as finance, healthcare, and weather. However, the task of selecting the most suitable forecasting method for a given dataset is a complex task due to the diversity of data patterns and characteristics. This research aims to address this challenge by proposing a comprehensive benchmark for evaluating and ranking time series forecasting methods across a wide range of datasets. This study investigates the comparative performance of many methods from two prominent time series forecasting frameworks, AutoGluon-Timeseries, and sktime to shed light on their applicability in different real-world scenarios. This research contributes to the field of time series forecasting by providing a robust benchmarking methodology, and facilitating informed decision-making when choosing forecasting methods for achieving optimal prediction.
- Oceania > Australia (0.14)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
- (6 more...)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
- Health & Medicine (1.00)
- Energy (1.00)
Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets
Caruso, Camillo Maria, Soda, Paolo, Guarrasi, Valerio
Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce "Not Another Imputation Method" (NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM employs feature-specific embeddings and a masked self-attention mechanism that effectively learns from available data, thus avoiding the necessity to impute missing values. Additionally, a novel regularization technique is introduced to enhance the model's generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 4 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.
A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base
Toossi, Hasti, Huai, Guo Qing, Liu, Jinyu, Khiu, Eric, Doğruöz, A. Seza, Lee, En-Shiun Annie
In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL's ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31\% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Indonesia > Bali (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (4 more...)
A Missing Value Filling Model Based on Feature Fusion Enhanced Autoencoder
Liu, Xinyao, Du, Shengdong, Li, Tianrui, Teng, Fei, Yang, Yan
With the advent of the big data era, the data quality problem is becoming more critical. Among many factors, data with missing values is one primary issue, and thus developing effective imputation models is a key topic in the research community. Recently, a major research direction is to employ neural network models such as self-organizing mappings or automatic encoders for filling missing values. However, these classical methods can hardly discover interrelated features and common features simultaneously among data attributes. Especially, it is a very typical problem for classical autoencoders that they often learn invalid constant mappings, which dramatically hurts the filling performance. To solve the above-mentioned problems, we propose a missing-value-filling model based on a feature-fusion-enhanced autoencoder. We first incorporate into an autoencoder a hidden layer that consists of de-tracking neurons and radial basis function neurons, which can enhance the ability of learning interrelated features and common features. Besides, we develop a missing value filling strategy based on dynamic clustering that is incorporated into an iterative optimization process. This design can enhance the multi-dimensional feature fusion ability and thus improves the dynamic collaborative missing-value-filling performance. The effectiveness of the proposed model is validated by extensive experiments compared to a variety of baseline methods on thirteen data sets.
- Asia > China > Beijing > Beijing (0.05)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > California > Orange County > Irvine (0.04)
- Asia > China > Sichuan Province > Chengdu (0.04)
- Information Technology (0.46)
- Health & Medicine (0.46)
Counterfactual Explanation with Missing Values
Kanamori, Kentaro, Takagi, Takuya, Kobayashi, Ken, Ike, Yuichi
Counterfactual Explanation (CE) is a post-hoc explanation method that provides a perturbation for altering the prediction result of a classifier. Users can interpret the perturbation as an "action" to obtain their desired decision results. Existing CE methods require complete information on the features of an input instance. However, we often encounter missing values in a given instance, and the previous methods do not work in such a practical situation. In this paper, we first empirically and theoretically show the risk that missing value imputation methods affect the validity of an action, as well as the features that the action suggests changing. Then, we propose a new framework of CE, named Counterfactual Explanation by Pairs of Imputation and Action (CEPIA), that enables users to obtain valid actions even with missing values and clarifies how actions are affected by imputation of the missing values. Specifically, our CEPIA provides a representative set of pairs of an imputation candidate for a given incomplete instance and its optimal action. We formulate the problem of finding such a set as a submodular maximization problem, which can be solved by a simple greedy algorithm with an approximation guarantee. Experimental results demonstrated the efficacy of our CEPIA in comparison with the baselines in the presence of missing values.
- Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Data Preprocessing with scikit-learn -- Missing Values
By popular demand from my previous article, in this tutorial I illustrate how to preprocess data using scikit-learn, a Python library for machine learning. Data preprocessing transforms data into a format which is more suitable for estimators. In my previous articles I illustrated how to deal with missing values, normalization, standardization, formatting and binning with Python pandas. In this tutorial I show you how to deal with mising values with scikit-learn. For the other preprocessing techniques in scikit-learn, I will write other posts.
- Information Technology > Data Science (0.64)
- Information Technology > Artificial Intelligence > Machine Learning (0.36)
- Information Technology > Communications > Social Media (0.32)
Recurrent Neural Networks for Multivariate Time Series with Missing Values
Gated Recurrent Units (GRUs) are gating mechanisms introduced in 2014 by Cho et al. Unlike LSTMs that have 3 gates, in GRUs have 2 gates to operate the time series data. Its main structure can be seen in Figure 3, and for further understanding, Understanding GRU Networks is highly recommended. Also, if you want to understand LSTMs and GRUs in the place, this article is recommended: Illustrated Guide to LSTM's and GRU's: A step by step explanation.