AITopics

doi: 10.1109/TPAMI.2023.3279394

2311.15214

Country:

Asia > China > Shaanxi Province > Xi'an (0.05)
Asia > Middle East > Jordan (0.04)
South America > Paraguay > Asunción > Asunción (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Goschenhofer, Jann, Bischl, Bernd, Kira, Zsolt

ConstraintMatch for Semi-constrained Clustering

arXiv.org Machine LearningNov-26-2023

Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.

artificial intelligence, constraintmatch, machine learning, (16 more...)

2311.15395

Country:

North America > United States (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Basu, Supratik, Choudhury, Jyotishka Ray, Paul, Debolina, Das, Swagatam

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means

arXiv.org Machine LearningNov-26-2023

Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.

algorithm, artificial intelligence, machine learning, (19 more...)

2311.15384

Country:

Europe > Austria > Vienna (0.14)
Asia > Middle East > Jordan (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Rajput, Vaishali, Mulay, Preeti, Raje, Rajeev

Speech-Based Blood Pressure Estimation with Enhanced Optimization and Incremental Clustering

Commented [ZS1]: Add ORCIDs to all authors that have them. Abstract Blood Pressure (BP) estimation plays a pivotal role in diagnosing various health conditions, highlighting the need for innovative approaches to overcome conventional measurement challenges. Leveraging machine learning and speech signals, this study investigates accurate BP estimation with a focus on preprocessing, feature extraction, and real-time applications. An advanced clusteringbased strategy, incorporating the k-means algorithm and the proposed Fact-Finding Instructor optimization algorithm, is introduced to enhance accuracy. The combined outcome of these clustering techniques enables robust BP estimation. Moreover, extending beyond these insights, this study delves into the dynamic realm of contemporary digital content consumption. Platforms like YouTube have emerged as influential spaces, presenting an array of videos that evoke diverse emotions. Within this context, this research investigates the interplay between YouTube videos and physiological responses, particularly Blood Pressure (BP) levels. By integrating advanced BP estimation techniques with the emotional dimensions of YouTube videos, this study enriches our understanding of how modern media environments intersect with health implications. Performance evaluation through metrics including Davies Bouldin score, Homogeneity, completeness, Jacquard similarity, Silhouette score, and Dunn's index demonstrates substantial enhancements, particularly with a 90% training percentage. This method offers promising potential for accurate BP estimation, contributing to the evolution of assessment methodologies and ultimately enhancing healthcare outcomes. Introduction: According to Kaur et al. (2019) a human disease is a particular aberrant state that has a detrimental effect on an organism's overall structure or function but is not instantly caused by an external injury. According to Gautam et al.(2019) there are four main groups of diseases that affect humans: infectious diseases, deficiency disorders, hereditary diseases, and physiological diseases.

algorithm, artificial intelligence, machine learning, (16 more...)

2311.15098

Country:

Asia > India (0.14)
North America > United States (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Energy > Oil & Gas (1.00)
Health & Medicine > Diagnostic Medicine > Vital Signs (0.85)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

A Hybrid SOM and K-means Model for Time Series Energy Consumption Clustering

Majidi, Farideh

Energy consumption analysis plays a pivotal role in addressing the challenges of sustainability and resource management. This paper introduces a novel approach to effectively cluster monthly energy consumption patterns by integrating two powerful techniques: Self-organizing maps and K-means clustering. The proposed method aims to exploit the benefits of both of these algorithms to enhance the accuracy and interpretability of clustering results for a dataset in which finding patterns is difficult. The main focus of this study is on a selection of time series energy consumption data from the Smart meters in London dataset. The data was preprocessed and reduced in dimensionality to capture essential temporal patterns while retaining their underlying structures. The SOM algorithm was utilized to extract the central representatives of the consumption patterns for each one of the houses over the course of each month, effectively reducing the dimensionality of the dataset and making it easier for analysis. Subsequently, the obtained SOM centroids were clustered using K-means, a popular centroid-based clustering technique. The experimental results demonstrated a significant silhouette score of 66%, indicating strong intra-cluster cohesion and inter-cluster separation which confirms the effectiveness of the proposed approach in the clustering task.

algorithm, consumption pattern, time series data, (15 more...)

2312.11475

Country:

Asia > Indonesia > Sulawesi > South Sulawesi > Makassar (0.05)
Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
Asia > Vietnam (0.04)

Genre: Research Report > Promising Solution (0.66)

Industry: Energy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Wilkins-Caruana, Adrian, Bandara, Madhushi, Musial, Katarzyna, Catchpoole, Daniel, Kennedy, Paul J.

Inferring Actual Treatment Pathways from Patient Records

Treatment pathways are step-by-step plans outlining the recommended medical care for specific diseases; they get revised when different treatments are found to improve patient outcomes. Examining health records is an important part of this revision process, but inferring patients' actual treatments from health data is challenging due to complex event-coding schemes and the absence of pathway-related annotations. This study aims to infer the actual treatment steps for a particular patient group from administrative health records (AHR) - a common form of tabular healthcare data - and address several technique- and methodology-based gaps in treatment pathway-inference research. We introduce Defrag, a method for examining AHRs to infer the real-world treatment steps for a particular patient group. Defrag learns the semantic and temporal meaning of healthcare event sequences, allowing it to reliably infer treatment steps from complex healthcare data. To our knowledge, Defrag is the first pathway-inference method to utilise a neural network (NN), an approach made possible by a novel, self-supervised learning objective. We also developed a testing and validation framework for pathway inference, which we use to characterise and evaluate Defrag's pathway inference ability and compare against baselines. We demonstrate Defrag's effectiveness by identifying best-practice pathway fragments for breast cancer, lung cancer, and melanoma in public healthcare records. Additionally, we use synthetic data experiments to demonstrate the characteristics of the Defrag method, and to compare Defrag to several baselines where it significantly outperforms non-NN-based methods. Defrag significantly outperforms several existing pathway-inference methods and offers an innovative and effective approach for inferring treatment pathways from AHRs. Open-source code is provided to encourage further research in this area.

defrag, experiment, pathway, (14 more...)

doi: 10.1016/j.jbi.2023.104554

2309.01897

Country:

North America > United States (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > Canada > Ontario (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Therapeutic Area > Oncology > Lung Cancer (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Leeney, William, McConville, Ryan

Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection

Graph Neural Networks (GNNs) have improved unsupervised community detection of clustered nodes due to their ability to encode the dual dimensionality of the connectivity and feature information spaces of graphs. Identifying the latent communities has many practical applications from social networks to genomics. Current benchmarks of real world performance are confusing due to the variety of decisions influencing the evaluation of GNNs at this task. To address this, we propose a framework to establish a common evaluation protocol. We motivate and justify it by demonstrating the differences with and without the protocol. The W Randomness Coefficient is a metric proposed for assessing the consistency of algorithm rankings to quantify the reliability of results under the presence of randomness. We find that by ensuring the same evaluation criteria is followed, there may be significant differences from the reported performance of methods at this task, but a more complete evaluation and comparison of methods is possible.

algorithm, arxiv preprint arxiv, dgi dmon grace mvgrl selfgnn, (11 more...)

2305.06026

Country:

North America > United States > Texas (0.05)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Nepal (0.04)

Genre: Research Report (0.40)

Industry:

Government > Regional Government (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
(3 more...)

Cao, Yueqi, Leung, Prudence, Monod, Anthea

$k$-Means Clustering for Persistent Homology

arXiv.org Machine LearningNov-25-2023

Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the $k$-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework. Additionally, we perform numerical experiments on various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that $k$-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.

artificial intelligence, machine learning, persistence diagram, (14 more...)

2210.10003

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan > Honshū > Chūgoku > Shimane Prefecture > Matsue (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.92)

Sun, Xiyang, Komaki, Fumiyasu

BHGNN-RT: Network embedding for directed heterogeneous graphs

arXiv.org Artificial IntelligenceNov-24-2023

Networks are one of the most valuable data structures for modeling problems in the real world. However, the most recent node embedding strategies have focused on undirected graphs, with limited attention to directed graphs, especially directed heterogeneous graphs. In this study, we first investigated the network properties of directed heterogeneous graphs. Based on network analysis, we proposed an embedding method, a bidirectional heterogeneous graph neural network with random teleport (BHGNN-RT), for directed heterogeneous graphs, that leverages bidirectional message-passing process and network heterogeneity. With the optimization of teleport proportion, BHGNN-RT is beneficial to overcome the over-smoothing problem. Extensive experiments on various datasets were conducted to verify the efficacy and efficiency of BHGNN-RT. Furthermore, we investigated the effects of message components, model layer, and teleport proportion on model performance. The performance comparison with all other baselines illustrates that BHGNN-RT achieves state-of-the-art performance, outperforming the benchmark methods in both node classification and unsupervised clustering tasks.

bhgnn-rt, graph, heterogeneous graph, (15 more...)

2311.14404

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Greece (0.04)
Europe > Germany > Bavaria > Lower Franconia > Würzburg (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.66)

arXiv.org Machine LearningNov-24-2023

DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder

Bing, Zhenshan, Meng, Yuan, Yun, Yuqi, Su, Hang, Su, Xiaojie, Huang, Kai, Knoll, Alois

Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva

artificial intelligence, data mining, machine learning, (18 more...)

2305.14067

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > North Macedonia > Skopje Statistical Region > Skopje Municipality > Skopje (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)