AITopics | Wolkenhauer, Olaf

Collaborating Authors

Wolkenhauer, Olaf

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Preserving logical and functional dependencies in synthetic tabular data

Umesh, Chaithra, Schultz, Kristian, Mahendra, Manjunath, Bej, Saparshi, Wolkenhauer, Olaf

arXiv.org Artificial IntelligenceSep-26-2024

Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models. Keywords: Synthetic tabular data, Logical dependencies, Functional dependencies, Generative models 1. Introduction Dependencies among attributes are a common aspect of tabular data. A well-known fact in Database Management Systems is that if one wants to remove redundancies by dividing larger tables into smaller ones (Normalization) [1], one needs tools to identify functional dependencies present among the attributes of the larger table [2]. Preserving functional dependencies in synthetic tabular data is an area that has not been explored. Dependencies exist in both tabular and image data.

data mining, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2409.17684

Country:

Asia (0.93)
Europe > Germany > Bavaria > Upper Bavaria (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (0.71)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

Convex space learning for tabular synthetic data generation

Mahendra, Manjunath, Umesh, Chaithra, Bej, Saptarshi, Schultz, Kristian, Wolkenhauer, Olaf

arXiv.org Artificial IntelligenceJul-13-2024

Generating synthetic samples from the convex space of the minority class is a popular oversampling approach for imbalanced classification problems. Recently, deep-learning approaches have been successfully applied to modeling the convex space of minority samples. Beyond oversampling, learning the convex space of neighborhoods in training data has not been used to generate entire tabular datasets. In this paper, we introduce a deep learning architecture (NextConvGeN) with a generator and discriminator component that can generate synthetic samples by learning to model the convex space of tabular data. The generator takes data neighborhoods as input and creates synthetic samples within the convex space of that neighborhood. Thereafter, the discriminator tries to classify these synthetic samples against a randomly sampled batch of data from the rest of the data space. We compared our proposed model with five state-of-the-art tabular generative models across ten publicly available datasets from the biomedical domain. Our analysis reveals that synthetic samples generated by NextConvGeN can better preserve classification and clustering performance across real and synthetic data than other synthetic data generation models. Synthetic data generation by deep learning of the convex space produces high scores for popular utility measures. We further compared how diverse synthetic data generation strategies perform in the privacy-utility spectrum and produced critical arguments on the necessity of high utility models. Our research on deep learning of the convex space of tabular data opens up opportunities in clinical research, machine learning model development, decision support systems, and clinical data sharing.

deep learning, machine learning, tabular synthetic data generation, (2 more...)

arXiv.org Artificial Intelligence

2407.09789

Genre: Research Report > Experimental Study (0.53)

Industry: Health & Medicine (0.53)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multivariate Functional Linear Discriminant Analysis for the Classification of Short Time Series with Missing Data

Bordoloi, Rahul, Réda, Clémence, Trautmann, Orell, Bej, Saptarshi, Wolkenhauer, Olaf

arXiv.org Artificial IntelligenceFeb-20-2024

Functional linear discriminant analysis (FLDA) is a powerful tool that extends LDA-mediated multiclass classification and dimension reduction to univariate time-series functions. However, in the age of large multivariate and incomplete data, statistical dependencies between features must be estimated in a computationally tractable way, while also dealing with missing data. There is a need for a computationally tractable approach that considers the statistical dependencies between features and can handle missing values. We here develop a multivariate version of FLDA (MUDRA) to tackle this issue and describe an efficient expectation/conditional-maximization (ECM) algorithm to infer its parameters. We assess its predictive power on the "Articulary Word Recognition" data set and show its improvement over the state-of-the-art, especially in the case of missing data. MUDRA allows interpretable classification of data sets with large proportions of missing data, which will be particularly useful for medical or psychological data sets.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2402.13103

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Germany > Bavaria > Upper Bavaria (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Discriminant Analysis (0.61)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

A multi-schematic classifier-independent oversampling approach for imbalanced datasets

Bej, Saptarshi, Schultz, Kristian, Srivastava, Prashant, Wolfien, Markus, Wolkenhauer, Olaf

arXiv.org Artificial IntelligenceJul-15-2021

Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach: ProWRAS(Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS)algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority classdata, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS has four oversampling schemes, each of which has its unique way to model the variance of the generated data. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five sate-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and Kappa-score. Moreover, we have introduced a novel measure for classifier independence I-score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

algorithm, artificial intelligence, information management, (20 more...)

arXiv.org Artificial Intelligence

2107.07349

Country: Europe > Germany > Bavaria > Upper Bavaria (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.47)

Industry: Education (0.67)

Technology:

Information Technology > Information Management (0.93)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

LoRAS: An oversampling approach for imbalanced datasets

Bej, Saptarshi, Davtyan, Narek, Wolfien, Markus, Nassar, Mariam, Wolkenhauer, Olaf

arXiv.org Machine LearningAug-23-2019

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 28 publicly available datasets and show that that drawing samples from an approximated data manifold of the minority class is the key to successful oversampling. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better Machine Learning (ML) models in terms of F1-score and Balanced Accuracy. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space.

dataset, health & medicine, oncology, (20 more...)

arXiv.org Machine Learning

1908.08346

Country:

North America > United States (0.14)
Europe > Germany (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback