Goto

Collaborating Authors

 Regression


Heterogeneous Distributed Lag Models to Estimate Personalized Effects of Maternal Exposures to Air Pollution

arXiv.org Machine Learning

Children's health studies support an association between maternal environmental exposures and children's birth and health outcomes. A common goal in such studies is to identify critical windows of susceptibility -- periods during gestation with increased association between maternal exposures and a future outcome. The associations and timings of critical windows are likely heterogeneous across different levels of individual, family, and neighborhood characteristics. However, the few studies that have considered effect modification were limited to a few pre-specified subgroups. We propose a statistical learning method to estimate critical windows at the individual level and identify important characteristics that induce heterogeneity. The proposed approach uses distributed lag models (DLMs) modified by Bayesian additive regression trees to account for effect heterogeneity based on a potentially high-dimensional set of modifying factors. We show in a simulation study that our model can identify both critical windows and modifiers responsible for DLM heterogeneity. We estimate the relationship between weekly exposures to fine particulate matter during gestation and birth weight in an administrative Colorado birth cohort. We identify maternal body mass index (BMI), age, Hispanic designation, and education as modifiers of the distributed lag effects and find non-Hispanics with increased BMI to be a susceptible population.


Data-driven Residual Generation for Early Fault Detection with Limited Data

arXiv.org Artificial Intelligence

Traditionally, fault detection and isolation community has used system dynamic equations to generate diagnosers and to analyze detectability and isolability of the dynamic systems. Model-based fault detection and isolation methods use system model to generate a set of residuals as the bases for fault detection and isolation. However, in many complex systems it is not feasible to develop highly accurate models for the systems and to keep the models updated during the system lifetime. Recently, data-driven solutions have received an immense attention in the industries systems for several practical reasons. First, these methods do not require the initial investment and expertise for developing accurate models. Moreover, it is possible to automatically update and retrain the diagnosers as the system or the environment change over time. Finally, unlike the model-based methods it is straight forward to combine time series measurements such as pressure and voltage with other sources of information such as system operating hours to achieve a higher accuracy. In this paper, we extend the traditional model-based fault detection and isolation concepts such as residuals, and detectable and isolable faults to the data-driven domain. We then propose an algorithm to automatically generate residuals from the normal operating data. We present the performance of our proposed approach through a comparative case study.


Conditional Cross-Design Synthesis Estimators for Generalizability in Medicaid

arXiv.org Machine Learning

While much of the causal inference literature has focused on addressing internal validity biases, both internal and external validity are necessary for unbiased estimates in a target population of interest. However, few generalizability approaches exist for estimating causal quantities in a target population when the target population is not well-represented by a randomized study but is reflected when additionally incorporating observational data. To generalize to a target population represented by a union of these data, we propose a class of novel conditional cross-design synthesis estimators that combine randomized and observational data, while addressing their respective biases. The estimators include outcome regression, propensity weighting, and double robust approaches. All use the covariate overlap between the randomized and observational data to remove potential unmeasured confounding bias. We apply these methods to estimate the causal effect of managed care plans on health care spending among Medicaid beneficiaries in New York City.


onlineforecast: An R package for adaptive and recursive forecasting

arXiv.org Machine Learning

Systems that rely on forecasts to make decisions, e.g. control or energy trading systems, require frequent updates of the forecasts. Usually, the forecasts are updated whenever new observations become available, hence in an online setting. We present the R package onlineforecast that provides a generalized setup of data and models for online forecasting. It has functionality for time-adaptive fitting of linear regression-based models. Furthermore, dynamical and non-linear effects can be easily included in the models. The setup is tailored to enable effective use of forecasts as model inputs, e.g. numerical weather forecast. Users can create new models for their particular system applications and run models in an operational online setting. The package also allows users to easily replace parts of the setup, e.g. use kernel or neural network methods for estimation. The package comes with comprehensive vignettes and examples of online forecasting applications in energy systems, but can easily be applied in all fields where online forecasting is used.


Distributionally Robust Multiclass Classification and Applications in Deep CNN Image Classifiers

arXiv.org Machine Learning

We develop a Distributionally Robust Optimization (DRO) formulation for Multiclass Logistic Regression (MLR), which could tolerate data contaminated by outliers. The DRO framework uses a probabilistic ambiguity set defined as a ball of distributions that are close to the empirical distribution of the training set in the sense of the Wasserstein metric. We relax the DRO formulation into a regularized learning problem whose regularizer is a norm of the coefficient matrix. We establish out-of-sample performance guarantees for the solutions to our model, offering insights on the role of the regularizer in controlling the prediction error. We apply the proposed method in rendering deep CNN-based image classifiers robust to random and adversarial attacks. Specifically, using the MNIST and CIFAR-10 datasets, we demonstrate reductions in test error rate by up to 78.8% and loss by up to 90.8%. We also show that with a limited number of perturbed images in the training set, our method can improve the error rate by up to 49.49% and the loss by up to 68.93% compared to Empirical Risk Minimization (ERM), converging faster to an ideal loss/error rate as the number of perturbed images increases.


Data Summarization via Bilevel Optimization

arXiv.org Machine Learning

The increasing availability of massive data sets poses a series of challenges for machine learning. Prominent among these is the need to learn models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is to operate on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and $k$-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning.


A Tour of Machine Learning Algorithms

#artificialintelligence

In this post, we will take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are available. There are so many algorithms that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit. I want to give you two ways to think about and categorize the algorithms you may come across in the field. Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.


Modelling the transition to a low-carbon energy supply

arXiv.org Artificial Intelligence

A transition to a low-carbon electricity supply is crucial to limit the impacts of climate change. Reducing carbon emissions could help prevent the world from reaching a tipping point, where runaway emissions are likely. Runaway emissions could lead to extremes in weather conditions around the world -- especially in problematic regions unable to cope with these conditions. However, the movement to a low-carbon energy supply can not happen instantaneously due to the existing fossil-fuel infrastructure and the requirement to maintain a reliable energy supply. Therefore, a low-carbon transition is required, however, the decisions various stakeholders should make over the coming decades to reduce these carbon emissions are not obvious. This is due to many long-term uncertainties, such as electricity, fuel and generation costs, human behaviour and the size of electricity demand. A well choreographed low-carbon transition is, therefore, required between all of the heterogenous actors in the system, as opposed to changing the behaviour of a single, centralised actor. The objective of this thesis is to create a novel, open-source agent-based model to better understand the manner in which the whole electricity market reacts to different factors using state-of-the-art machine learning and artificial intelligence methods. In contrast to other works, this thesis looks at both the long-term and short-term impact that different behaviours have on the electricity market by using these state-of-the-art methods.


Overview of the CLEF-2019 CheckThat!: Automatic Identification and Verification of Claims

arXiv.org Artificial Intelligence

We present an overview of the second edition of the CheckThat! Lab at CLEF 2019. The lab featured two tasks in two different languages: English and Arabic. Task 1 (English) challenged the participating systems to predict which claims in a political debate or speech should be prioritized for fact-checking. Task 2 (Arabic) asked to (A) rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) identify useful passages from these pages, and (D) use the useful pages to predict the claim's factuality. CheckThat! provided a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) and normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. A total of 47 teams registered to participate in this lab, and fourteen of them actually submitted runs (compared to nine last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. As for Task 2, learning-to-rank was used by the highest scoring runs for subtask A, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important tasks of check-worthiness estimation and automatic claim verification.


5 Best Online Biostatistics Programs and Courses

#artificialintelligence

Are you looking for Best Online Biostatistics Programs and Courses?… If yes, then your search will end here. In this article, I am going to share the 5 Best Online Biostatistics Programs and Courses with you. So, give your few minutes to this article and find out the best online Biostatistics program for you. The goal of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public's health.