bicluster
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
Evaluating the statistical significance of biclusters
Jason D. Lee, Yuekai Sun, Jonathan E. Taylor
Biclustering (also known as submatrix localization) is a problem of high practical relevance in exploratory analysis of high-dimensional data. We develop a framework for performing statistical inference on biclusters found by score-based algorithms. Since the bicluster was selected in a data dependent manner by a biclustering or localization algorithm, this is a form of selective inference . Our framework gives exact (non-asymptotic) confidence intervals and p-values for the significance of the selected biclusters.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
Biclustering Usinig Message Passing
Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column memberships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approximates a general likelihood function, separating a cluster size penalty term into row-and column-count penalties. Because we use a global optimization framework, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expression datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density.
Exact and Heuristic Algorithms for Constrained Biclustering
Biclustering, also known as co-clustering or two-way clustering, simultaneously partitions the rows and columns of a data matrix to reveal submatrices with coherent patterns. Incorporating background knowledge into clustering to enhance solution quality and interpretability has attracted growing interest in mathematical optimization and machine learning research. Extending this paradigm to biclustering enables prior information to guide the joint grouping of rows and columns. We study constrained biclustering with pairwise constraints, namely must-link and cannot-link constraints, which specify whether objects should belong to the same or different biclusters. As a model problem, we address the constrained version of the k-densest disjoint biclique problem, which aims to identify k disjoint complete bipartite subgraphs (called bicliques) in a weighted complete bipartite graph, maximizing the total density while satisfying pairwise constraints. We propose both exact and heuristic algorithms. The exact approach is a tailored branch-and-cut algorithm based on a low-dimensional semidefinite programming (SDP) relaxation, strengthened with valid inequalities and solved in a cutting-plane fashion. Exploiting integer programming tools, a rounding scheme converts SDP solutions into feasible biclusterings at each node. For large-scale instances, we introduce an efficient heuristic based on the low-rank factorization of the SDP. The resulting nonlinear optimization problem is tackled with an augmented Lagrangian method, where the subproblem is solved by decomposition through a block-coordinate projected gradient algorithm. Extensive experiments on synthetic and real-world datasets show that the exact method significantly outperforms general-purpose solvers, while the heuristic achieves high-quality solutions efficiently on large instances.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.92)
Orthogonal Factor-Based Biclustering Algorithm (BCBOF) for High-Dimensional Data and Its Application in Stock Trend Prediction
Biclustering is an effective technique in data mining and pattern recognition. Biclustering algorithms based on traditional clustering face two fundamental limitations when processing high-dimensional data: (1) The distance concentration phenomenon in high-dimensional spaces leads to data sparsity, rendering similarity measures ineffective; (2) Mainstream linear dimensionality reduction methods disrupt critical local structural patterns. To apply biclustering to high-dimensional datasets, we propose an orthogonal factor-based bicluster-ing algorithm (BCBOF). First, we constructed orthogonal factors in the vector space of the high-dimensional dataset. Then, we performed clustering using the coordinates of the original data in the orthogonal subspace as clustering targets. Finally, we obtained biclustering results of the original dataset. Since dimensionality reduction was applied before clustering, the proposed algorithm effectively mitigated the data sparsity problem caused by high dimensionality. Additionally, we applied this biclustering algorithm to stock technical indicator combinations and stock price trend prediction. Biclustering results were transformed into fuzzy rules, and we incorporated profit-preserving and stop-loss rules into the rule set, ultimately forming a fuzzy inference system for stock price trend predictions and trading signals. The results showed that our algorithm outperformed other biclustering techniques. To validate the effectiveness of the fuzzy inference system, we conducted virtual trading experiments using historical data from 10 A-share stocks. The experimental results showed that the generated trading strategies yielded higher returns for investors. Introduction Since its initial proposal by Cheng and Church[1], biclustering has evolved into a sophisticated analytical approach.
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > New York (0.04)
- (4 more...)
RelDenClu: A Relative Density based Biclustering Method for identifying non-linear feature relations
Jain, Namita, Ghosh, Susmita, Murthy, C. A.
The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and joint densities for each pair of features to find the subset of observations, which forms the bases of the relation between them. It then finds the set of features connected by a common set of observations, resulting in a bicluster. To show the effectiveness of the proposed methodology, experimentation has been carried out on fifteen types of simulated datasets. Further, it has been applied to six real-life datasets. For three of these real-life datasets, the proposed method is used for unsupervised learning, while for other three real-life datasets it is used as an aid to supervised learning. For all the datasets the performance of the proposed method is compared with that of seven different state-of-the-art algorithms and the proposed algorithm is seen to produce better results. The efficacy of proposed algorithm is also seen by its use on COVID-19 dataset for identifying some features (genetic, demographics and others) that are likely to affect the spread of COVID-19.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > India > West Bengal > Kolkata (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- Research Report > Experimental Study (0.47)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
- Health & Medicine > Therapeutic Area > Immunology (1.00)
- Health & Medicine > Epidemiology (1.00)
Biclustering Usinig Message Passing
Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column memberships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approximates a general likelihood function, separating a cluster size penalty term into row-and column-count penalties. Because we use a global optimization framework, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expression datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density.
Tensor Biclustering
Soheil Feizi, Hamid Javadi, David Tse
Consider a dataset where data is collected on multiple features of multiple individuals over multiple times. This type of data can be represented as a three dimensional individual/feature/time tensor and has become increasingly prominent in various areas of science. The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. We study the information-theoretic limit of this problem under a generative model. Moreover, we propose an efficient spectral algorithm to solve the tensor biclustering problem and analyze its achievability bound in an asymptotic regime. Finally, we show the efficiency of our proposed method in several synthetic and real datasets.
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
HBIC: A Biclustering Algorithm for Heterogeneous Datasets
José-García, Adán, Jacques, Julie, Chauvet, Clément, Sobanski, Vincent, Dhaenens, Clarisse
Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.87)
UnPaSt: unsupervised patient stratification by differentially expressed biclusters in omics data
Hartung, Michael, Maier, Andreas, Delgado-Chaves, Fernando, Burankova, Yuliya, Isaeva, Olga I., Patroni, Fábio Malta de Sá, He, Daniel, Shannon, Casey, Kaufmann, Katharina, Lohmann, Jens, Savchik, Alexey, Hartebrodt, Anne, Chervontseva, Zoe, Firoozbakht, Farzaneh, Probul, Niklas, Zotova, Evgenia, Tsoy, Olga, Blumenthal, David B., Ester, Martin, Laske, Tanja, Baumbach, Jan, Zolotareva, Olga
Most complex diseases, including cancer and non-malignant diseases like asthma, have distinct molecular subtypes that require distinct clinical approaches. However, existing computational patient stratification methods have been benchmarked almost exclusively on cancer omics data and only perform well when mutually exclusive subtypes can be characterized by many biomarkers. Here, we contribute with a massive evaluation attempt, quantitatively exploring the power of 22 unsupervised patient stratification methods using both, simulated and real transcriptome data. From this experience, we developed UnPaSt (https://apps.cosy.bio/unpast/) optimizing unsupervised patient stratification, working even with only a limited number of subtype-predictive biomarkers. We evaluated all 23 methods on real-world breast cancer and asthma transcriptomics data. Although many methods reliably detected major breast cancer subtypes, only few identified Th2-high asthma, and UnPaSt significantly outperformed its closest competitors in both test datasets. Essentially, we showed that UnPaSt can detect many biologically insightful and reproducible patterns in omic datasets.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.57)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)