Due to the rapid innovation of technology and the desire to find and employ biomarkers for neurodegenerative disease, high-dimensional data classification problems are routinely encountered in neuroimaging studies. To avoid over-fitting and to explore relationships between disease and potential biomarkers, feature learning and selection plays an important role in classifier construction and is an important area in machine learning. In this article, we review several important feature learning and selection techniques including lasso-based methods, PCA, the two-sample t-test, and stacked auto-encoders. We compare these approaches using a numerical study involving the prediction of Alzheimer's disease from Magnetic Resonance Imaging.
We propose a Bayesian mixed-effects model to learn typical scenarios of changes from longitudinal manifold-valued data, namely repeated measurements of the same objects or individuals at several points in time. The model allows to estimate a group-average trajectory in the space of measurements. Random variations of this trajectory result from spatiotemporal transformations, which allow changes in the direction of the trajectory and in the pace at which trajectories are followed. The use of the tools of Riemannian geometry allows to derive a generic algorithm for any kind of data with smooth constraints, which lie therefore on a Riemannian manifold. Stochastic approximations of the Expectation-Maximization algorithm is used to estimate the model parameters in this highly non-linear setting.The method is used to estimate a data-driven model of the progressive impairments of cognitive functions during the onset of Alzheimer's disease. Experimental results show that the model correctly put into correspondence the age at which each individual was diagnosed with the disease, thus validating the fact that it effectively estimated a normative scenario of disease progression. Random effects provide unique insights into the variations in the ordering and timing of the succession of cognitive impairments across different individuals.
Inter-subject parcellation of functional Magnetic Resonance Imaging (fMRI) data based on a standard General Linear Model (GLM)and spectral clustering was recently proposed as a means to alleviate the issues associated with spatial normalization in fMRI. However, for all its appeal, a GLM-based parcellation approach introduces its own biases, in the form of a priori knowledge about the shape of Hemodynamic Response Function (HRF) and task-related signal changes, or about the subject behaviour during the task. In this paper, we introduce a data-driven version of the spectral clustering parcellation, based on Independent Component Analysis (ICA) and Partial Least Squares (PLS) instead of the GLM. First, a number of independent components are automatically selected. Seed voxels are then obtained from the associated ICA maps and we compute the PLS latent variables between the fMRI signal of the seed voxels (which covers regional variations of the HRF) and the principal components of the signal across all voxels. Finally, we parcellate all subjects data with a spectral clustering of the PLS latent variables. We present results of the application of the proposed method on both single-subject and multi-subject fMRI datasets. Preliminary experimental results, evaluated with intra-parcel variance of GLM t-values and PLS derived t-values, indicate that this data-driven approach offers improvement in terms of parcellation accuracy over GLM based techniques.
Yin, Zhijun (Vanderbilt University) | Chen, You (Vanderbilt University) | Fabbri, Daniel (Vanderbilt University) | Sun, Jimeng (Georgia Institute of Technology) | Malin, Bradley (Vanderbilt University)
User-generated content in social media is increasingly acknowledged as a rich resource for research into health problems. One particular area of interest is in the semantics individuals’ evoke because they can influence when health-related information is disclosed. While there have been multiple investigations into why self-disclose occurs, much less is known about when individuals choose to disclose information about other people (e.g., a relative), which is a significant privacy concern. In this paper, we introduce a novel framework to investigate how semantics influence disclosure routines for 34 health issues. This framework begins with a supervised classification model to distinguish tweets that communicate personal health issues from confounding concepts (e.g., metaphorical statements that include a health-related keyword). Next, we annotate tweets for each health issue with linguistic and psychological categories (e.g. social processes, affective processes and personal concerns). Then, we apply a non-negative matrix factorization over a health issue-by-language category space. Finally, the factorized basis space is leveraged to group health issues into natural aggregations based around how they are discussed. We evaluate this framework with four months of tweets (over 200 million) and show that certain semantics correspond with whom a health mention pertains to. Our findings show that health issues related with family members, high medical cost and social support (e.g., Alzheimer's Disease, cancer, and Down syndrome) lead to tweets that are more likely to disclose another individual's health status, while tweets with more benign health issues (e.g., allergy, arthritis, and bronchitis) with biological processes (e.g., health and ingestion) and negative emotions are more likely to contain self-disclosures.
Analysis of spontaneous speech is an important tool for clinical linguists to diagnose various dementia types that affect the language processing areas. Prosody is affected by some dementia types, most notably Parkinson's disease (PD, degradation of voice quality, unstable pitch), Alzheimer's disease (AD, monotonic pitch), and the non-fluent type of Primary Progressive Aphasia (PPA-NF, hesitant, non-fluent speech). Prosodic features can be computed efficiently by software. In this study, we evaluate the performance of a SVM classifier that is trained on prosodic features only. The limitation to only prosody yields baseline results that can be used in a later stage to evaluate the added effect of variables of (morpho) syntax. The goal is to distinguish different dementia types based on the recorded speech. Results show that the classifier can distinguish some dementia types (PPA-NF, AD), but not others (PD, PPA-SD).