Goto

Collaborating Authors

 South America


Feature Selection for Imbalanced Data with Deep Sparse Autoencoders Ensemble

arXiv.org Machine Learning

Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by Feature Selection (FS), that offers several further advantages, s.a. decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become sub-optimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the Reconstruction Error of a Deep Sparse AutoEncoders Ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated Reconstruction Error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.


Am I fit for this physical activity? Neural embedding of physical conditioning from inertial sensors

arXiv.org Artificial Intelligence

Inertial Measurement Unit (IMU) sensors are becoming increasingly ubiquitous in everyday devices such as smartphones, fitness watches, etc. As a result, the array of health-related applications that tap onto this data has been growing, as well as the importance of designing accurate prediction models for tasks such as human activity recognition (HAR). However, one important task that has received little attention is the prediction of an individual's heart rate when undergoing a physical activity using IMU data. This could be used, for example, to determine which activities are safe for a person without having him/her actually perform them. We propose a neural architecture for this task composed of convolutional and LSTM layers, similarly to the state-of-the-art techniques for the closely related task of HAR. However, our model includes a convolutional network that extracts, based on sensor data from a previously executed activity, a physical conditioning embedding (PCE) of the individual to be used as the LSTM's initial hidden state. We evaluate the proposed model, dubbed PCE-LSTM, when predicting the heart rate of 23 subjects performing a variety of physical activities from IMU-sensor data available in public datasets (PAMAP2, PPG-DaLiA). For comparison, we use as baselines the only model specifically proposed for this task, and an adapted state-of-the-art model for HAR. PCE-LSTM yields over 10% lower mean absolute error. We demonstrate empirically that this error reduction is in part due to the use of the PCE. Last, we use the two datasets (PPG-DaLiA, WESAD) to show that PCE-LSTM can also be successfully applied when photoplethysmography (PPG) sensors are available to rectify heart rate measurement errors caused by movement, outperforming the state-of-the-art deep learning baselines by more than 30%.


Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

arXiv.org Artificial Intelligence

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.


Learning Accurate Business Process Simulation Models from Event Logs via Automated Process Discovery and Deep Learning

arXiv.org Artificial Intelligence

Business process simulation is a well-known approach to estimate the impact of changes to a process with respect to time and cost measures -- a practice known as what-if process analysis. The usefulness of such estimations hinges on the accuracy of the underlying simulation model. Data-Driven Simulation (DDS) methods combine automated process discovery and enhancement techniques to learn process simulation models from event logs. Empirical studies have shown that, while DDS models adequately capture the observed sequences of activities and their frequencies, they fail to capture the temporal dynamics of real-life processes. In contrast, parallel work has shown that generative Deep Learning (DL) models are able to accurately capture such temporal dynamics. The drawback of these latter models is that users cannot alter them for what-if analysis due to their black-box nature. This paper presents a hybrid approach to learn process simulation models from event logs wherein a (stochastic) process model is extracted from a log using automated process discovery and enhancement techniques, and this model is then combined with a DL model to generate timestamped event sequences (traces). An experimental evaluation shows that the resulting hybrid simulation models match the temporal accuracy of pure DL models, while retaining the what-if analysis capability of DDS approaches.


MasakhaNER: Named Entity Recognition for African Languages

arXiv.org Artificial Intelligence

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.


The AI Wars: lessons from the conflict that paralyzed the field

#artificialintelligence

Rosenblatt led the design of a computer to implement this idea and tried to train it to recognize the differences between males and females in photos. "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."


Why Artificial Intelligence Will Make You Question Everything?

#artificialintelligence

Today, however, another revolution is unfolding that has potentially further reaching ramifications. According to experts, artificial intelligence is going to significantly change and alter the way humans manufacture, produce and deliver. In other words, it will change the way we work, live and connect with one another. Moreover, the scale of this change will be unlike anything we have experienced before. AI entails all attempts to make machines and devices think just like humans do.


10 Years of the PCG workshop: Past and Future Trends

arXiv.org Artificial Intelligence

In the decade since the first PCG workshop, research in artificial intelligence (AI) for generating game content has bloomed. PCG As of 2020, the international workshop on Procedural Content Generation research of all types has been accepted in high-tier conferences enters its second decade. The annual workshop, hosted by and journals, and three special issues on topics directly relevant the international conference on the Foundations of Digital Games, to PCG [10, 53, 99] were published in the IEEE Transactions on has collected a corpus of 95 papers published in its first 10 years. Games (and the preceding IEEE Transactions on Computational This paper provides an overview of the workshop's activities and Intelligence and AI in Games). A textbook on Procedural Content surveys the prevalent research topics emerging over the years.


Unsupervised and self-adaptative techniques for cross-domain person re-identification

arXiv.org Artificial Intelligence

Person Re-Identification (ReID) across non-overlapping cameras is a challenging task and, for this reason, most works in the prior art rely on supervised feature learning from a labeled dataset to match the same person in different views. However, it demands the time-consuming task of labeling the acquired data, prohibiting its fast deployment, specially in forensic scenarios. Unsupervised Domain Adaptation (UDA) emerges as a promising alternative, as it performs feature-learning adaptation from a model trained on a source to a target domain without identity-label annotation. However, most UDA-based algorithms rely upon a complex loss function with several hyper-parameters, which hinders the generalization to different scenarios. Moreover, as UDA depends on the translation between domains, it is important to select the most reliable data from the unseen domain, thus avoiding error propagation caused by noisy examples on the target data -- an often overlooked problem. In this sense, we propose a novel UDA-based ReID method that optimizes a simple loss function with only one hyper-parameter and that takes advantage of triplets of samples created by a new offline strategy based on the diversity of cameras within a cluster. This new strategy adapts the model and also regularizes it, avoiding overfitting on the target domain. We also introduce a new self-ensembling strategy, in which weights from different iterations are aggregated to create a final model combining knowledge from distinct moments of the adaptation. For evaluation, we consider three well-known deep learning architectures and combine them for final decision-making. The proposed method does not use person re-ranking nor any label on the target domain, and outperforms the state of the art, with a much simpler setup, on the Market to Duke, the challenging Market1501 to MSMT17, and Duke to MSMT17 adaptation scenarios.


Homophily Outlier Detection in Non-IID Categorical Data

arXiv.org Artificial Intelligence

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.