We address the problem of image collection summarization by learning mixtures of submodular functions. We argue that submodularity is very natural to this problem, and we show that a number of previously used scoring functions are submodular — a property not explicitly mentioned in these publications. We provide classes of submodular functions capturing the necessary properties of summaries, namely coverage, likelihood, and diversity. To learn mixtures of these submodular functions as scoring functions, we formulate summarization as a supervised learning problem using large-margin structured prediction. Furthermore, we introduce a novel evaluation metric, which we call V-ROUGE, for automatic summary scoring. While a similar metric called ROUGE has been successfully applied to document summarization , no such metric was known for quantifying the quality of image collection summaries. We provide a new dataset consisting of 14 real-world image collections along with many human-generated ground truth summaries collected using mechanical turk. We also extensively compare our method with previously explored methods for this problem and show that our learning approach outperforms all competitors on this new dataset. This paper provides, to our knowledge, the first systematic approach for quantifying the problem of image collection summarization, along with a new dataset of image collections and human summaries.
Due to advances in sensors, growing large and complex medical image data have the ability to visualize the pathological change in the cellular or even the molecular level or anatomical changes in tissues and organs. As a consequence, the medical images have the potential to enhance diagnosis of disease, prediction of clinical outcomes, characterization of disease progression, management of health care and development of treatments, but also pose great methodological and computational challenges for representation and selection of features in image cluster analysis. To address these challenges, we first extend one dimensional functional principal component analysis to the two dimensional functional principle component analyses (2DFPCA) to fully capture space variation of image signals. Image signals contain a large number of redundant and irrelevant features which provide no additional or no useful information for cluster analysis. Widely used methods for removing redundant and irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on how to select penalty parameters and a threshold for selecting features. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attention in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image cluster analysis. The proposed method is applied to ovarian and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis.
Jin, Ling (Lawrence Berkeley National Laboratory) | Lee, Doris (Lawrence Berkeley National Laboratory) | Sim, Alex (Lawrence Berkeley National Laboratory) | Borgeson, Sam (Lawrence Berkeley National Laboratory) | Wu, Kesheng (Lawrence Berkeley National Laboratory) | Spurlock, C. Anna (Lawrence Berkeley National Laboratory) | Todd, Annika (Lawrence Berkeley National Laboratory)
Current practice in whole time series clustering of residential meter data focuses on aggregated or subsampled load data at the customer level, which ignores day-to-day differences within customers. This information is critical to determine each customer’s suitability to various demand side management strategies that support intelligent power grids and smart energy management. Clustering daily load shapes provides fine-grained information on customer attributes and sources of variation for subsequent models and customer segmentation. In this paper, we apply 11 clustering methods to daily residential meter data. We evaluate their parameter settings and suitability based on 6 generic performance metrics and post-checking of resulting clusters. Finally, we recommend suitable techniques and parameters based on the goal of discovering diverse daily load patterns among residential customers. To the authors’ knowledge, this paper is the first robust comparative review of clustering techniques applied to daily residential load shape time series in the power systems’ literature.
Residential transformer population is a critical type of asset that many electric utility companies have been attempting to manage proactively and effectively to reduce unexpected failures and life losses that are often caused by transformer overloading. Within the typical power asset portfolio, the residential transformer asset is often large in population, having lowest reliability design, lacking transformer loading data and susceptible to customer loading behaviors such as adoption of distributed energy resources and electric vehicles. On the bright side, the availability of more residential operation data along with the advancement of data analytics techniques have provided a new path to further our understanding of local residential transformer overloading risks statistically. This research developed a new data-driven method to combine clustering analysis and the simulation of transformer temperature rise and insulation life loss to quantitatively and statistically assess the overloading risk of residential transformer population in one area and suggest proper risk management measures according to the assessment results. Case studies from an actual Canadian utility company have been presented and discussed in detail to demonstrate the applicability and usefulness of the proposed method.
Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.