Clustering
Manifold Learning of Four-dimensional Scanning Transmission Electron Microscopy
Li, Xin, Dyck, Ondrej E., Oxley, Mark P., Lupini, Andrew R., McInnes, Leland, Healy, John, Jesse, Stephen, Kalinin, Sergei V.
Four-dimensional scanning transmission electron microscopy (4D-STEM) of local atomic diffraction patterns is emerging as a powerful technique for probing intricate details of atomic structure and atomic electric fields. However, efficient processing and interpretation of large volumes of data remain challenging, especially for two-dimensional or light materials because the diffraction signal recorded on the pixelated arrays is weak. Here we employ data-driven manifold leaning approaches for straightforward visualization and exploration analysis of the 4D-STEM datasets, distilling real-space neighboring effects on atomically resolved deflection patterns from single-layer graphene, with single dopant atoms, as recorded on a pixelated detector. These extracted patterns relate to both individual atom sites and sublattice structures, effectively discriminating single dopant anomalies via multi-mode views. We believe manifold learning analysis will accelerate physics discoveries coupled between data-rich imaging mechanisms and materials such as ferroelectric, topological spin and van der Waals heterostructures.
Solar Enablement Initiative in Australia: Report on Efficiently Identifying Critical Cases for Evaluating the Voltage Impact of Large PV Investment
Shafiei, Mehdi, Liu, Aaron, Ledwich, Gerard, Walker, Geoffery, Morosini, Gian-Marco, Terry, Jack
The increasing quantity of PV generation connected to distribution networks is creating challenges in maintaining and controlling voltages in those distribution networks. Determining the maximum hosting capacity for new PV installations based on the historical data is an essential task for distribution networks. Analyzing all historical data in large distribution networks is impractical. Therefore, this paper focuses on how to time efficiently identify the critical cases for evaluating the voltage impacts of the new large PV applications in medium voltage (MV) distribution networks. A systematic approach is proposed to cluster medium voltage nodes based on electrical adjacency and time blocks. MV nodes are clustered along with the voltage magnitudes and time blocks. Critical cases of each cluster can be used for further power flow study. This method is scalable and can time efficiently identify cases for evaluating PV investment on medium voltage networks.
Advanced machine learning informatics modeling using clinical and radiological imaging metrics for characterizing breast tumor characteristics with the OncotypeDX gene array
Jacobs, Michael A., Umbricht, Christopher, Parekh, Vishwa, Khouli, Riham El, Cope, Leslie, Macura, Katarzyna J., Harvey, Susan, Wolff, Antonio C.
Purpose-Optimal use of established and imaging methods, such as multiparametric magnetic resonance imaging(mpMRI) can simultaneously identify key functional parameters and provide unique imaging phenotypes of breast cancer. Therefore, we have developed and implemented a new machine-learning informatic system that integrates clinical variables, derived from imaging and clinical health records, to compare with the 21-gene array assay, OncotypeDX. Materials and methods-We tested our informatics modeling in a subset of patients (n=81) who had ER+ disease and underwent OncotypeDX gene expression and breast mpMRI testing. The machine-learning informatic method is termed Integrated Radiomic Informatic System-IRIS was applied to the mpMRI, clinical and pathologic descriptors, as well as a gene array analysis. The IRIS method using an advanced graph theoretic model and quantitative metrics. Summary statistics (mean and standard deviations) for the quantitative imaging parameters were obtained. Sensitivity and specificity and Area Under the Curve were calculated for the classification of the patients. Results-The OncotypeDX classification by IRIS model had sensitivity of 95% and specificity of 89% with AUC of 0.92. The breast lesion size was larger for the high-risk groups and lower for both low risk and intermediate risk groups. There were significant differences in PK-DCE and ADC map values in each group. The ADC map values for high- and intermediate-risk groups were significantly lower than the low-risk group. Conclusion-These initial studies provide deeper understandings of imaging features and molecular gene array OncotypeDX score. This insight provides the foundation to relate these imaging features to the assessment of treatment response for improved personalized medicine.
Scalable Bottom-up Subspace Clustering using FP-Trees for High Dimensional Data
Doan, Minh Tuan, Qi, Jianzhong, Rajasegarar, Sutharshan, Leckie, Christopher
Subspace clustering aims to find groups of similar objects (clusters) that exist in lower dimensional subspaces from a high dimensional dataset. It has a wide range of applications, such as analysing high dimensional sensor data or DNA sequences. However, existing algorithms have limitations in finding clusters in non-disjoint subspaces and scaling to large data, which impinge their applicability in areas such as bioinformatics and the Internet of Things. We aim to address such limitations by proposing a subspace clustering algorithm using a bottom-up strategy. Our algorithm first searches for base clusters in low dimensional subspaces. It then forms clusters in higher-dimensional subspaces using these base clusters, which we formulate as a frequent pattern mining problem. This formulation enables efficient search for clusters in higher-dimensional subspaces, which is done using FP-trees. The proposed algorithm is evaluated against traditional bottom-up clustering algorithms and state-of-the-art subspace clustering algorithms. The experimental results show that the proposed algorithm produces clusters with high accuracy, and scales well to large volumes of data. We also demonstrate the algorithm's performance using real-life data, including ten genomic datasets and a car parking occupancy dataset.
Day-ahead time series forecasting: application to capacity planning
Leverger, Colin, Lemaire, Vincent, Malinowski, Simon, Guyet, Thomas, Rozé, Laurence
In the context of capacity planning, forecasting the evolution of informatics servers usage enables companies to better manage their computational resources. We address this problem by collecting key indicator time series and propose to forecast their evolution a day-ahead. Our method assumes that data is structured by a daily seasonality, but also that there is typical evolution of indicators within a day. Then, it uses the combination of a clustering algorithm and Markov Models to produce day-ahead forecasts. Our experiments on real datasets show that the data satisfies our assumption and that, in the case study, our method outperforms classical approaches (AR, Holt-Winters).
Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix Profile
Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix Profile by Chin-Chia Michael Yeh Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, September 2018 Dr. Eamonn Keogh, Chairperson The last decade has seen a flurry of research on all-pairs-similarity-search (or, self-join) for text, DNA, and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. Surprisingly, however, little progress has been made on addressing this problem for time series subsequences. In this thesis, we have introduced a near universal time series data mining tool called matrix profile which solves the all-pairssimilarity-search problem and caches the output in an easy-to-access fashion. The proposed algorithm is not only parameter-free, exact and scalable, but also applicable for both single and multidimensional time series. By building time series data mining methods on top of matrix profile, many time series data mining tasks (e.g., motif discovery, discord discovery, shapelet discovery, semantic segmentation, and clustering) can be efficiently solved. Because the same matrix profile can be shared by a diverse set of time series data mining methods, matrix profile is versatile and computed-once-use-many-times data structure. We demonstrate the utility of matrix profile for many time series data mining problems, including motif discovery, discord discovery, weakly labeled time series classification, and vi representation learning on domains as diverse as seismology, entomology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring, and medicine. We hope the matrix profile is not the end but the beginning of many more time series data mining projects.
Integrating Project Spatial Coordinates into Pavement Management Prioritization
Elbagalati, Omar, Hajij, Mustafa
To date, pavement management software products and studies on optimizing the prioritization of pavement maintenance and rehabilitation (M&R) have been mainly focused on three parameters; the pre-treatment pavement condition, the rehabilitation cost, and the available budget. Yet, the role of the candidate projects' spatial characteristics in the decision-making process has not been deeply considered. Such a limitation, predominately, allows the recommended M&R projects' schedule to involve simultaneously running but spatially scattered construction sites, which are very challenging to monitor and manage. This study introduces a novel approach to integrate pavement segments' spatial coordinates into the M&R prioritization analysis. The introduced approach aims at combining the pavement segments with converged spatial coordinates to be repaired in the same timeframe without compromising the allocated budget levels or the overall target Pavement Condition Index (PCI). Such a combination would result in minimizing the routing of crews, materials and other equipment among the construction sites and would provide better collaborations and communications between the pavement maintenance teams. Proposed herein is a novel spatial clustering algorithm that automatically finds the projects within a certain budget and spatial constrains. The developed algorithm was successfully validated using 1,800 pavement maintenance projects from two real-life examples of the City of Milton, GA and the City of Tyler, TX.
Multilayer Graph Signal Clustering
Gheche, Mireille El, Frossard, Pascal
Multilayer graphs are commonly used to model relationships of different types between data points. In this paper, we propose a method for multilayer graph data clustering, which combines the different graph layers in the Riemann manifold of Semi-Positive Definite (SPD) graph laplacian matrices. The resulting combination can be seen as a low-dimensional representation of the original data points. In addition, we consider that data can also carry signal values and not only graph information. We thus propose new clustering solution for such hybrid data by training a neural network such that the transformed data points are orthonormal, and their distance on the aggregated graph is minimized. Experiments on synthetic and real data show that our method leads to a significant improvement with respect to state-of-the-art clustering algorithms for graph data.
A Fast Algorithm for Clustering High Dimensional Feature Vectors
Rahman, Shahina, Johnson, Valen E.
We propose an algorithm for clustering high dimensional data. If $P$ features for $N$ objects are represented in an $N\times P$ matrix ${\bf X}$, where $N\ll P$, the method is based on exploiting the cluster-dependent structure of the $N\times N$ matrix ${\bf XX}^T$. Computational burden thus depends primarily on $N$, the number of objects to be clustered, rather than $P$, the number of features that are measured. This makes the method particularly useful in high dimensional settings, where it is substantially faster than a number of other popular clustering algorithms. Aside from an upper bound on the number of potential clusters, the method is independent of tuning parameters. When compared to $16$ other clustering algorithms on $32$ genomic datasets with gold standards, we show that it provides the most accurate cluster configuration more than twice as often than its closest competitors. We illustrate the method on data taken from highly cited genomic studies.
Foundations of Comparison-Based Hierarchical Clustering
Ghoshdastidar, Debarghya, Perrot, Michaël, von Luxburg, Ulrike
We address the classical problem of hierarchical clustering, but in a framework where one does not have access to a representation of the objects or their pairwise similarities. Instead we assume that only a set of comparisons between objects are available in terms of statements of the form "objects $i$ and $j$ are more similar than objects $k$ and $l$". Such a scenario is commonly encountered in crowdsourcing applications. The focus of this work is to develop comparison-based hierarchical clustering algorithms that do not rely on the principles of ordinal embedding. We propose comparison-based variants of average linkage clustering. We provide statistical guarantees for the proposed methods under a planted partition model for hierarchical clustering. We also empirically demonstrate the performance of the proposed methods on several datasets.