Clustering
Machine Learning Performance Analysis to Predict Stroke Based on Imbalanced Medical Dataset
Cerebral stroke, the second most substantial cause of death universally, has been a primary public health concern over the last few years. With the help of machine learning techniques, early detection of various stroke alerts is accessible, which can efficiently prevent or diminish the stroke. Medical datasets, however, are frequently unbalanced in their class label, with a tendency to poorly predict minority classes. In this paper, the potential risk factors for stroke are investigated. Moreover, four distinctive approaches are applied to improve the classification of the minority class in the imbalanced stroke dataset, which are the ensemble weight voting classifier, the Synthetic Minority Over-sampling Technique (SMOTE), Principal Component Analysis with K-Means Clustering (PCA-Kmeans), Focal Loss with the Deep Neural Network (DNN) and compare their performance. Through the analysis results, SMOTE and PCA-Kmeans with DNN-Focal Loss work best for the limited size of a large severe imbalanced dataset (e.g., Stroke dataset), which is 2-4 times outperform Kaggle's work.
A Dataset and Baseline Approach for Identifying Usage States from Non-Intrusive Power Sensing With MiDAS IoT-based Sensors
Muppasani, Bharath, Anand, Cheyyur Jaya, Appajigowda, Chinmayi, Srivastava, Biplav, Johri, Lokesh
Authors in (Rajapaksha and The growth in the deployment of Internet of Things (IoT) Bergmeir 2022) focused on providing rule based explanations sensors across different industries has opened several opportunities for a particular forecast, considering the global forecasting for the economy. One of them is the collection of IoT model as a black-box model trained across multivariate data that companies can use to build smarter solutions.
Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization
Chen, Zheng, Zhu, Lingwei, Yang, Ziwei, Matsubara, Takashi
Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problem -- agnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.
Out-of-Dynamics Imitation Learning from Multimodal Demonstrations
Qiu, Yiwen, Wu, Jialong, Cao, Zhangjie, Long, Mingsheng
Existing imitation learning works mainly assume that the demonstrator who collects demonstrations shares the same dynamics as the imitator. However, the assumption limits the usage of imitation learning, especially when collecting demonstrations for the imitator is difficult. In this paper, we study out-of-dynamics imitation learning (OOD-IL), which relaxes the assumption to that the demonstrator and the imitator have the same state spaces but could have different action spaces and dynamics. OOD-IL enables imitation learning to utilize demonstrations from a wide range of demonstrators but introduces a new challenge: some demonstrations cannot be achieved by the imitator due to the different dynamics. Prior works try to filter out such demonstrations by feasibility measurements, but ignore the fact that the demonstrations exhibit a multimodal distribution since the different demonstrators may take different policies in different dynamics. We develop a better transferability measurement to tackle this newly-emerged challenge. We firstly design a novel sequence-based contrastive clustering algorithm to cluster demonstrations from the same mode to avoid the mutual interference of demonstrations from different modes, and then learn the transferability of each demonstration with an adversarial-learning based algorithm in each cluster. Experiment results on several MuJoCo environments, a driving environment, and a simulated robot environment show that the proposed transferability measurement more accurately finds and down-weights non-transferable demonstrations and outperforms prior works on the final imitation learning performance. We show the videos of our experiment results on our website.
Online Correlation Clustering for Dynamic Complete Signed Graphs
In the correlation clustering problem for complete signed graphs, the input is a complete signed graph with edges weighted as $+1$ (denote recommendation to put this pair in the same cluster) or $-1$ (recommending to put this pair of vertices in separate clusters) and the target is to cluster the set of vertices such that the number of disagreements with these recommendations is minimized. In this paper, we consider the problem of correlation clustering for dynamic complete signed graphs where (1) a vertex can be added or deleted, and (2) the sign of an edge can be flipped. In the proposed online scheme, the offline approximation algorithm in [CALM+21] for correlation clustering is used. Up to the author's knowledge, this is the first online algorithm for dynamic graphs which allows a full set of graph editing operations. The proposed approach is rigorously analyzed and compared with a baseline method, which runs the original offline algorithm on each time step. Our results show that the dynamic operations have local effects on the neighboring vertices and we employ this locality to reduce the dependency of the running time in the Baseline to the summation of the degree of all vertices in $G_t$, the graph after applying the graph edit operation at time step $t$, to the summation of the degree of the changing vertices (e.g. two endpoints of an edge) and the number of clusters in the previous time step. Moreover, the required working memory is reduced to the square of the summation of the degree of the modified edge endpoints rather than the total number of vertices in the graph.
Salvatore (Sal) Magnone on LinkedIn: Breaking it Down: K-Means Clustering
We are pleased to announce, thanks to the acquisition of Latham BioPharm Group, we are accelerating in the Life Sciences sector with the creation of a Life Science Consulting division. Latham BioPharm Group founded in 1996 by Peter Latham, is one of the top Life Science Consulting specialty firms supporting Non-Dilutive Funding, Product Development, and Strategic decision-making for a wide range of sectors and clients. "Pete and I have agreed on a global goal, supported by significant financial resources, that will allow us to grow organically at more than 20% per year...." says Matthieu Courtecuisse, Founder and CEO of Sia Partners.
A Pipeline for Business Intelligence and Data-Driven Root Cause Analysis on Categorical Data
Thakar, Shubham, Kalbande, Dhananjay
Business intelligence (BI) is any knowledge derived from existing data that may be strategically applied within a business. Data mining is a technique or method for extracting BI from data using statistical data modeling. Finding relationships or correlations between the various data items that have been collected can be used to boost business performance or at the very least better comprehend what is going on. Root cause analysis (RCA) is discovering the root causes of problems or events to identify appropriate solutions. RCA can show why an event occurred and this can help in avoiding occurrences of an issue in the future. This paper proposes a new clustering + association rule mining pipeline for getting business insights from data. The results of this pipeline are in the form of association rules having consequents, antecedents, and various metrics to evaluate these rules. The results of this pipeline can help in anchoring important business decisions and can also be used by data scientists for updating existing models or while developing new ones. The occurrence of any event is explained by its antecedents in the generated rules. Hence this output can also help in data-driven root cause analysis.
Inv-SENnet: Invariant Self Expression Network for clustering under biased data
Singh, Ashutosh, Singh, Ashish, Masoomi, Aria, Imbiriba, Tales, Learned-Miller, Erik, Erdogmus, Deniz
Subspace clustering algorithms are used for understanding the cluster structure that explains the dataset well. These methods are extensively used for data-exploration tasks in various areas of Natural Sciences. However, most of these methods fail to handle unwanted biases in datasets. For datasets where a data sample represents multiple attributes, naively applying any clustering approach can result in undesired output. To this end, we propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces. Assuming we have information about the bias, we regularize the clustering method by adversarially learning to minimize the mutual information between the data and the unwanted attributes. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.
Doubly Inhomogeneous Reinforcement Learning
Hu, Liyuan, Li, Mengbing, Shi, Chengchun, Wu, Zhenke, Fryzlewicz, Piotr
This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.
Non-parametric Clustering of Multivariate Populations with Arbitrary Sizes
Bakam, Yves Ismaël Ngounou, Pommeret, Denys
We propose a clustering procedure to group K populations into subgroups with the same dependence structure. The method is adapted to paired population and can be used with panel data. It relies on the differences between orthogonal projection coefficients of the K density copulas estimated from the K populations. Each cluster is then constituted by populations having significantly similar dependence structures. A recent test statistic from Ngounou-Bakam and Pommeret (2022) is used to construct automatically such clusters. The procedure is data driven and depends on the asymptotic level of the test. We illustrate our clustering algorithm via numerical studies and through two real datasets: a panel of financial datasets and insurance dataset of losses and allocated loss adjustment expense.