Goto

Collaborating Authors

 Directed Networks


Automatic Language Identification in Texts: A Survey

Journal of Artificial Intelligence Research

Language identification ("LI") is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.


Estimation of preterm birth markers with U-Net segmentation network

arXiv.org Machine Learning

Preterm birth is the most common cause of neonatal death. Current diagnostic methods that assess the risk of preterm birth involve the collection of maternal characteristics and transvaginal ultrasound imaging conducted in the first and second trimester of pregnancy. Analysis of the ultrasound data is based on visual inspection of images by gynaecologist, sometimes supported by hand-designed image features such as cervical length. Due to the complexity of this process and its subjective component, approximately 30% of spontaneous preterm deliveries are not correctly predicted. Moreover, 10% of the predicted preterm deliveries are false-positives. In this paper, we address the problem of predicting spontaneous preterm delivery using machine learning. To achieve this goal, we propose to first use a deep neural network architecture for segmenting prenatal ultrasound images and then automatically extract two biophysical ultrasound markers, cervical length (CL) and anterior cervical angle (ACA), from the resulting images. Our method allows to estimate ultrasound markers without human oversight. Furthermore, we show that CL and ACA markers, when combined, allow us to decrease false-negative ratio from 30% to 18%. Finally, contrary to the current approaches to diagnostics methods that rely only on gynaecologist's expertise, our method introduce objectively obtained results.


DGSAN: Discrete Generative Self-Adversarial Network

arXiv.org Machine Learning

Although GAN-based methods have received many achievements in the last few years, they have not been such successful in generating discrete data. The most important challenge of these methods is the difficulty of passing the gradient from the discriminator to the generator when the generator outputs are discrete. Despite several attempts done to alleviate this problem, none of the existing GAN-based methods has improved the performance of text generation (using measures that evaluate both the quality and the diversity of generated samples) compared to a generative RNN that is simply trained by the maximum likelihood approach. In this paper, we propose a new framework for generating discrete data by an adversarial approach in which we do not need to pass the gradient to the generator. In the proposed method, the update of either the generator or the discriminator can be accomplished straightforwardly. Moreover, we leverage the discreteness of data to explicitly model the data distribution and ensure the normalization of the generated distribution and consequently the convergence properties of the proposed method. Experimental results generally show the superiority of the proposed DGSAN method compared to the other GAN-based approaches for generating discrete sequential data.


Heterogeneous Relational Kernel Learning

arXiv.org Machine Learning

Recent work has developed Bayesian methods for the automatic statistical analysis and description of single time series as well as of homogeneous sets of time series data. We extend prior work to create an interpretable kernel embedding for heterogeneous time series. Our method adds practically no computational cost compared to prior results by leveraging previously discarded intermediate results. We show the practical utility of our method by leveraging the learned embeddings for clustering, pattern discovery, and anomaly detection. These applications are beyond the ability of prior relational kernel learning approaches.


Scalable Modeling of Spatiotemporal Data using the Variational Autoencoder: an Application in Glaucoma

arXiv.org Machine Learning

Submitted to the Annals of Applied Statistics SCALABLE MODELING OF SPATIOTEMPORAL DATA USING THE VARIATIONAL AUTOENCODER: AN APPLICATION IN GLAUCOMA By Samuel I. Berchuck, Felipe A. Medeiros and Sayan Mukherjee Duke University As big spatial data becomes increasingly prevalent, classical spatiotemporal (ST) methods often do not scale well. While methods have been developed to account for high-dimensional spatial objects, the setting where there are exceedingly large samples of spatial observations has had less attention. The variational autoencoder (V AE), an unsupervised generative model based on deep learning and approximate Bayesian inference, fills this void using a latent variable specification that is inferred jointly across the large number of samples. In this manuscript, we compare the performance of the V AE with a more classical ST method when analyzing longitudinal visual fields from a large cohort of patients in a prospective glaucoma study. Through simulation and a case study, we demonstrate that the V AE is a scalable method for analyzing ST data, when the goal is to obtain accurate predictions. R code to implement the V AE can be found on GitHub: https://github.com/berchuck/vaeST. 1. Introduction. As high-speed computing and medical imaging become increasingly inexpensive, massive amounts of data are generated that have to be analyzed and are often spatial in nature (Bearden and Thompson, 2017; Smith and Nichols, 2018). In the case of medical imaging, the number of patients that can be imaged has skyrocketed in recent years, allowing for studies that include images from many thousands of patients (Van Essen et al., 2013; Miller et al., 2016). The current spatial statistics literature focuses heavily on scalability in terms of the number of spatial locations (Banerjee, 2017), however largely ignores the setting where a joint model is needed for spatiotemporal (ST) data that are generated from a large cohort. Historically, learning an appropriate generating process in this setting was untenable, typically leading to simplifying assumptions, such as point-wise (PW) modeling of locations across time (Fitzke et al., 1996). In particular, generative models using deep learning have shown great promise in modeling complex distributions, p( x), for x x 1: M in some potentially high-dimensional space X . Sampling from X is often intractable, so instead generative modeling learns a distribution q (x) that can be sampled from and is close to p (x) (Doersch, 2016). As such, generative modeling can be viewed as an approximate method for performing inference in high-dimensional contexts, when there is an overwhelming availability of observations x . Generative modeling, and in particular the variational auto-encoder (V AE), are well-suited for modeling large cohorts of ST data, because they can characterize variability in a spatial data source through joint modeling (Kingma and Welling, 2013).


Identification of Pediatric Sepsis Subphenotypes for Enhanced Machine Learning Predictive Performance: A Latent Profile Analysis

arXiv.org Machine Learning

Background: While machine learning (ML) models are rapidly emerging as promising screening tools in critical care medicine, the identification of homogeneous subphenotypes within populations with heterogeneous conditions such as pediatric sepsis may facilitate attainment of high-predictive performance of these prognostic algorithms. This study is aimed to identify subphenotypes of pediatric sepsis and demonstrate the potential value of partitioned data/subtyping-based training. Methods: This was a retrospective study of clinical data extracted from medical records of 6,446 pediatric patients that were admitted at a major hospital system in the DC area. Vitals and labs associated with patients meeting the diagnostic criteria for sepsis were used to perform latent profile analysis. Modern ML algorithms were used to explore the predictive performance benefits of reduced training data heterogeneity via label profiling. Results: In total 134 (2.1%) patients met the diagnostic criteria for sepsis in this cohort and latent profile analysis identified four profiles/subphenotypes of pediatric sepsis. Profiles 1 and 3 had the lowest mortality and included pediatric patients from different age groups. Profile 2 were characterized by respiratory dysfunction; profile 4 by neurological dysfunction and highest mortality rate (22.2%). Machine learning experiments comparing the predictive performance of models derived without training data profiling against profile targeted models suggest statistically significant improved performance of prediction can be obtained. For example, area under ROC curve (AUC) obtained to predict profile 4 with 24-hour data (AUC = .998, p < .0001) compared favorably with the AUC obtained from the model considering all profiles as a single homogeneous group (AUC = .918) with 24-hour data.


Consistent Classification with Generalized Metrics

arXiv.org Machine Learning

We propose a framework for constructing and analyzing multiclass and multioutput classification metrics, i.e., involving multiple, possibly correlated multiclass labels. Our analysis reveals novel insights on the geometry of feasible confusion tensors -- including necessary and sufficient conditions for the equivalence between optimizing an arbitrary non-decomposable metric and learning a weighted classifier. Further, we analyze averaging methodologies commonly used to compute multioutput metrics and characterize the corresponding Bayes optimal classifiers. We show that the plug-in estimator based on this characterization is consistent and is easily implemented as a post-processing rule. Empirical results on synthetic and benchmark datasets support the theoretical findings.


Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning

arXiv.org Machine Learning

Many decision problems in science, engineering and economics are affected by uncertain parameters whose distribution is only indirectly observable through samples. The goal of data-driven decision-making is to learn a decision from finitely many training samples that will perform well on unseen test samples. This learning task is difficult even if all training and test samples are drawn from the same distribution---especially if the dimension of the uncertainty is large relative to the training sample size. Wasserstein distributionally robust optimization seeks data-driven decisions that perform well under the most adverse distribution within a certain Wasserstein distance from a nominal distribution constructed from the training samples. In this tutorial we will argue that this approach has many conceptual and computational benefits. Most prominently, the optimal decisions can often be computed by solving tractable convex optimization problems, and they enjoy rigorous out-of-sample and asymptotic consistency guarantees. We will also show that Wasserstein distributionally robust optimization has interesting ramifications for statistical learning and motivates new approaches for fundamental learning tasks such as classification, regression, maximum likelihood estimation or minimum mean square error estimation, among others.


Reinforcement Learning in Healthcare: A Survey

arXiv.org Artificial Intelligence

As a subfield of machine learning, \emph{reinforcement learning} (RL) aims at empowering one's capabilities in behavioural decision making by using interaction experience with the world and an evaluative feedback. Unlike traditional supervised learning methods that usually rely on one-shot, exhaustive and supervised reward signals, RL tackles with sequential decision making problems with sampled, evaluative and delayed feedback simultaneously. Such distinctive features make RL technique a suitable candidate for developing powerful solutions in a variety of healthcare domains, where diagnosing decisions or treatment regimes are usually characterized by a prolonged and sequential procedure. This survey will discuss the broad applications of RL techniques in healthcare domains, in order to provide the research community with systematic understanding of theoretical foundations, enabling methods and techniques, existing challenges, and new insights of this emerging paradigm. By first briefly examining theoretical foundations and key techniques in RL research from efficient and representational directions, we then provide an overview of RL applications in a variety of healthcare domains, ranging from dynamic treatment regimes in chronic diseases and critical care, automated medical diagnosis from both unstructured and structured clinical data, as well as many other control or scheduling domains that have infiltrated many aspects of a healthcare system. Finally, we summarize the challenges and open issues in current research, and point out some potential solutions and directions for future research.


Opponent Aware Reinforcement Learning

arXiv.org Machine Learning

In several reinforcement learning (RL) scenarios such as security settings, there may be adversaries trying to interfere with the reward generating process for their own benefit. We introduce Threatened Markov Decision Processes (TMDPs) as a framework to support an agent against potential opponents in a RL context. We also propose a level-k thinking scheme resulting in a novel learning approach to deal with TMDPs. After introducing our framework and deriving theoretical results, relevant empirical evidence is given via extensive experiments, showing the benefits of accounting for adversaries in RL while the agent learns