Its impact is drastic and real: Youtube's AIdriven recommendation system would present sports videos for days if one happens to watch a live baseball game on the platform ; email writing becomes much faster with machine learning (ML) based auto-completion ; many businesses have adopted natural language processing based chatbots as part of their customer services . AI has also greatly advanced human capabilities in complex decision-making processes ranging from determining how to allocate security resources to protect airports  to games such as poker  and Go . All such tangible and stunning progress suggests that an "AI summer" is happening. As some put it, "AI is the new electricity" . Meanwhile, in the past decade, an emerging theme in the AI research community is the so-called "AI for social good" (AI4SG): researchers aim at developing AI methods and tools to address problems at the societal level and improve the wellbeing of the society.
Such information includes: the database in modern hospital systems, usually known as Electronic Health Records (EHR), which store the patients' diagnosis, medication, laboratory test results, medical image data, etc.; information on various health behaviors tracked and stored by wearable devices, ubiquitous sensors and mobile applications, such as the smoking status, alcoholism history, exercise level, sleeping conditions, etc.; information collected by census or various surveys regarding sociodemographic factors of the target cohort; and information on people's mental health inferred from their social media activities or social networks such as Twitter, Facebook, etc. These health-related data come from heterogeneous sources, describe assorted aspects of the individual's health conditions. Such data is rich in structure and information which has great research potentials for revealing unknown medical knowledge about genomic epidemiology, disease developments and correlations, drug discoveries, medical diagnosis, mental illness prevention, health behavior adaption, etc. In real-world problems, the number of features relating to a certain health condition could grow exponentially with the development of new information techniques for collecting and measuring data. To reveal the causal influence between various factors and a certain disease or to discover the correlations among diseases from data at such a tremendous scale, requires the assistance of advanced information technology such as data mining, machine learning, text mining, etc. Machine learning technology not only provides a way for learning qualitative relationships among features and patients, but also the quantitative parameters regarding the strength of such correlations.
Supervisory signals have the potential to make low-dimensional data representations, like those learned by mixture and topic models, more interpretable and useful. We propose a framework for training latent variable models that explicitly balances two goals: recovery of faithful generative explanations of high-dimensional data, and accurate prediction of associated semantic labels. Existing approaches fail to achieve these goals due to an incomplete treatment of a fundamental asymmetry: the intended application is always predicting labels from data, not data from labels. Our prediction-constrained objective for training generative models coherently integrates loss-based supervisory signals while enabling effective semi-supervised learning from partially labeled data. We derive learning algorithms for semi-supervised mixture and topic models using stochastic gradient descent with automatic differentiation. We demonstrate improved prediction quality compared to several previous supervised topic models, achieving predictions competitive with high-dimensional logistic regression on text sentiment analysis and electronic health records tasks while simultaneously learning interpretable topics.
Big data applications, such as medical imaging and genetics, typically generate datasets that consist of few observations n on many more variables p, a scenario that we denote as p>>n. Traditional data processing methods are often insufficient for extracting information out of big data. This calls for the development of new algorithms that can deal with the size, complexity, and the special structure of such datasets. In this paper, we consider the problem of classifying p>>n data and propose a classification method based on linear discriminant analysis (LDA). Traditional LDA depends on the covariance estimate of the data, but when p>>n the sample covariance estimate is singular. The proposed method estimates the covariance by using a sparse version of noisy principal component analysis (nPCA). The use of sparsity in this setting aims at automatically selecting variables that are relevant for classification. In experiments, the new method is compared to state-of-the art methods for big data problems using both simulated datasets and imaging genetics datasets.