Not enough data to create a plot.
Try a different view from the menu above.
Park, Haesun
Randomized Algorithms for Symmetric Nonnegative Matrix Factorization
Hayashi, Koby, Aksoy, Sinan G., Ballard, Grey, Park, Haesun
We propose the first randomized algorithms for Symmetric Nonnegative Matrix Factorization (SymNMF). Nonnegative Matrix Factorization (NMF) is an important method in data analysis with applications to data visualization, text mining, feature learning, information fusion and more [28, 25, 46, 22, 11]. SymNMF is a variant of NMF where the input matrix is symmetric and the output low-rank approximation is also constrained to be symmetric [25, 49]. Applications of SymNMF include (hyper)graph clustering, image segmentation, and information fusion [44, 10, 19, 5, 6]. Several randomized algorithms for nonsymmetric NMF have been previously proposed and shown to be effective for dense and small sparse problems [43, 41, 13], but as far as we are aware there is no prior work on randomized algorithms for SymNMF. Our contributions in this work include a randomized algorithm for SymNMF we call "Low-rank Approximated Input SymNMF" (LAI-SymNMF), a randomized algorithm based on leverage score sampling for least squares problems we call LvS-SymNMF, novel theoretical analysis of leverage score sampling for the Nonnegative Least Squares problem and theoretical analysis of a hybrid sampling scheme for leverage score sampling. The rest of the paper is organized as follows. Section 2, which discusses background material including non-randomized SymNMF algorithms, reviews existing randomized NMF methods and other related work such as randomized methods for other low-rank matrix decompositions and tensor decompositions.
WellFactor: Patient Profiling using Integrative Embedding of Healthcare Data
Choi, Dongjin, Xiang, Andy, Ozturk, Ozgur, Shrestha, Deep, Drake, Barry, Haidarian, Hamid, Javed, Faizan, Park, Haesun
In the rapidly evolving healthcare industry, platforms now have access to not only traditional medical records, but also diverse data sets encompassing various patient interactions, such as those from healthcare web portals. To address this rich diversity of data, we introduce WellFactor: a method that derives patient profiles by integrating information from these sources. Central to our approach is the utilization of constrained low-rank approximation. WellFactor is optimized to handle the sparsity that is often inherent in healthcare data. Moreover, by incorporating task-specific label information, our method refines the embedding results, offering a more informed perspective on patients. One important feature of WellFactor is its ability to compute embeddings for new, previously unobserved patient data instantaneously, eliminating the need to revisit the entire data set or recomputing the embedding. Comprehensive evaluations on real-world healthcare data demonstrate WellFactor's effectiveness. It produces better results compared to other existing methods in classification performance, yields meaningful clustering of patients, and delivers consistent results in patient similarity searches and predictions.
Patient Clustering via Integrated Profiling of Clinical and Digital Data
Choi, Dongjin, Xiang, Andy, Ozturk, Ozgur, Shrestha, Deep, Drake, Barry, Haidarian, Hamid, Javed, Faizan, Park, Haesun
We introduce a novel profile-based patient clustering model designed for clinical data in healthcare. By utilizing a method grounded on constrained low-rank approximation, our model takes advantage of patients' clinical data and digital interaction data, including browsing and search, to construct patient profiles. As a result of the method, nonnegative embedding vectors are generated, serving as a low-dimensional representation of the patients. Our model was assessed using real-world patient data from a healthcare web portal, with a comprehensive evaluation approach which considered clustering and recommendation capabilities. In comparison to other baselines, our approach demonstrated superior performance in terms of clustering coherence and recommendation accuracy.
Hypergraph Random Walks, Laplacians, and Clustering
Hayashi, Koby, Aksoy, Sinan G., Park, Cheong Hee, Park, Haesun
We propose a flexible framework for clustering hypergraph-structured data based on recently proposed random walks utilizing edge-dependent vertex weights. When incorporating edge-dependent vertex weights (EDVW), a weight is associated with each vertex-hyperedge pair, yielding a weighted incidence matrix of the hypergraph. Such weightings have been utilized in term-document representations of text data sets. We explain how random walks with EDVW serve to construct different hypergraph Laplacian matrices, and then develop a suite of clustering methods that use these incidence matrices and Laplacians for hypergraph clustering. Using several data sets from real-life applications, we compare the performance of these clustering algorithms experimentally against a variety of existing hypergraph clustering methods. We show that the proposed methods produce higher-quality clusters and conclude by highlighting avenues for future work.
TASTE: Temporal and Static Tensor Factorization for Phenotyping Electronic Health Records
Afshar, Ardavan, Perros, Ioakeim, Park, Haesun, deFilippi, Christopher, Yan, Xiaowei, Stewart, Walter, Ho, Joyce, Sun, Jimeng
Phenotyping electronic health records (EHR) focuses on defining meaningful patient groups (e.g., heart failure group and diabetes group) and identifying the temporal evolution of patients in those groups. Tensor factorization has been an effective tool for phenotyping. Most of the existing works assume either a static patient representation with aggregate data or only model temporal data. However, real EHR data contain both temporal (e.g., longitudinal clinical visits) and static information (e.g., patient demographics), which are difficult to model simultaneously. In this paper, we propose Temporal And Static TEnsor factorization (TASTE) that jointly models both static and temporal information to extract phenotypes. TASTE combines the PARAFAC2 model with non-negative matrix factorization to model a temporal and a static tensor. To fit the proposed model, we transform the original problem into simpler ones which are optimally solved in an alternating fashion. For each of the sub-problems, our proposed mathematical reformulations lead to efficient sub-problem solvers. Comprehensive experiments on large EHR data from a heart failure (HF) study confirmed that TASTE is up to 14x faster than several baselines and the resulting phenotypes were confirmed to be clinically meaningful by a cardiologist. Using 80 phenotypes extracted by TASTE, a simple logistic regression can achieve the same level of area under the curve (AUC) for HF prediction compared to a deep learning model using recurrent neural networks (RNN) with 345 features.
CoDiNMF: Co-Clustering of Directed Graphs via NMF
Lim, Woosang (Georgia Institute of Technology) | Du, Rundong (Georgia Institute of Technology) | Park, Haesun (Georgia Institute of Technology)
Co-clustering computes clusters of data items and the related features concurrently, and it has been used in many applications such as community detection, product recommendation, computer vision, and pricing optimization. In this paper, we propose a new co-clustering method, called CoDiNMF, which improves the clustering quality and finds directional patterns among co-clusters by using multiple directed and undirected graphs. We design the objective function of co-clustering by using min-cut criterion combined with an additional term which controls the sum of net directional flow between different co-clusters. In addition, we show that a variant of Nonnegative Matrix Factorization (NMF) can solve the proposed objective function effectively. We run experiments on the US patents and BlogCatalog data sets whose ground truth have been known, and show that CoDiNMF improves clustering results compared to other co-clustering methods in terms of average F1 score, Rand index, and adjusted Rand index (ARI). Finally, we compare CoDiNMF and other co-clustering methods on the Wikipedia data set of philosophers, and we can find meaningful directional flow of influence among co-clusters of philosophers.
Hybrid Clustering based on Content and Connection Structure using Joint Nonnegative Matrix Factorization
Du, Rundong, Drake, Barry, Park, Haesun
We present a hybrid method for latent information discovery on the data sets containing both text content and connection structure based on constrained low rank approximation. The new method jointly optimizes the Nonnegative Matrix Factorization (NMF) objective function for text clustering and the Symmetric NMF (SymNMF) objective function for graph clustering. We propose an effective algorithm for the joint NMF objective function, based on a block coordinate descent (BCD) framework. The proposed hybrid method discovers content associations via latent connections found using SymNMF. The method can also be applied with a natural conversion of the problem when a hypergraph formulation is used or the content is associated with hypergraph edges. Experimental results show that by simultaneously utilizing both content and connection structure, our hybrid method produces higher quality clustering results compared to the other NMF clustering methods that uses content alone (standard NMF) or connection structure alone (SymNMF). We also present some interesting applications to several types of real world data such as citation recommendations of papers. The hybrid method proposed in this paper can also be applied to general data expressed with both feature space vectors and pairwise similarities and can be extended to the case with multiple feature spaces or multiple similarity measures.
PIVE: Per-Iteration Visualization Environment for Real-Time Interactions with Dimension Reduction and Clustering
Kim, Hannah (Georgia Institute of Technology) | Choo, Jaegul (Korea University) | Lee, Changhyun (Google Inc.) | Lee, Hanseung (Google Inc.) | Reddy, Chandan K. (Virginia Institute of Technology) | Park, Haesun (Georgia Institute of Technology)
One of the key advantages of visual analytics is its capability to leverage both humans's visual perception and the power of computing. A big obstacle in integrating machine learning with visual analytics is its high computing cost. To tackle this problem, this paper presents PIVE (Per-Iteration Visualization Environment) that supports real-time interactive visualization with machine learning. By immediately visualizing the intermediate results from algorithm iterations, PIVE enables users to quickly grasp insights and interact with the intermediate output, which then affects subsequent algorithm iterations. In addition, we propose a widely-applicable interaction methodology that allows efficient incorporation of user feedback into virtually any iterative computational method without introducing additional computational cost. We demonstrate the application of PIVE for various dimension reduction algorithms such as multidimensional scaling and t-SNE and clustering and topic modeling algorithms such as k-means and latent Dirichlet allocation.
Outlier Detection for Text Data : An Extended Version
Kannan, Ramakrishnan, Woo, Hyenkyun, Aggarwal, Charu C., Park, Haesun
The problem of outlier detection is extremely challenging in many domains such as text, in which the attribute values are typically non-negative, and most values are zero. In such cases, it often becomes difficult to separate the outliers from the natural variations in the patterns in the underlying data. In this paper, we present a matrix factorization method, which is naturally able to distinguish the anomalies with the use of low rank approximations of the underlying data. Our iterative algorithm TONMF is based on block coordinate descent (BCD) framework. We define blocks over the term-document matrix such that the function becomes solvable. Given most recently updated values of other matrix blocks, we always update one block at a time to its optimal. Our approach has significant advantages over traditional methods for text outlier detection. Finally, we present experimental results illustrating the effectiveness of our method over competing methods.
MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
Kannan, Ramakrishnan, Ballard, Grey, Park, Haesun
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for $W$ and $H$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans for few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.