AITopics

2209.10423

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Scholbeck, Christian A., Funk, Henri, Casalicchio, Giuseppe

Algorithm-Agnostic Interpretations for Clustering

arXiv.org Artificial IntelligenceSep-21-2022

A clustering outcome for high-dimensional data is typically interpreted via post-processing, involving dimension reduction and subsequent visualization. This destroys the meaning of the data and obfuscates interpretations. We propose algorithm-agnostic interpretation methods to explain clustering outcomes in reduced dimensions while preserving the integrity of the data. The permutation feature importance for clustering represents a general framework based on shuffling feature values and measuring changes in cluster assignments through custom score functions. The individual conditional expectation for clustering indicates observation-wise changes in the cluster assignment due to changes in the data. The partial dependence for clustering evaluates average changes in cluster assignments for the entire feature space. All methods can be used with any clustering algorithm able to reassign instances through soft or hard labels. In contrast to common post-processing methods such as principal component analysis, the introduced methods maintain the original structure of the features.

artificial intelligence, data mining, machine learning, (18 more...)

2209.10578

Country:

North America > United States > Wisconsin (0.05)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

arXiv.org Artificial IntelligenceSep-21-2022

Attributed Network Embedding Model for Exposing COVID-19 Spread Trajectory Archetypes

Ma, Junwei, Li, Bo, Li, Qingchun, Fan, Chao, Mostafavi, Ali

The spread of COVID-19 revealed that transmission risk patterns are not homogenous across different cities and communities, and various heterogeneous features can influence the spread trajectories. Hence, for predictive pandemic monitoring, it is essential to explore latent heterogeneous features in cities and communities that distinguish their specific pandemic spread trajectories. To this end, this study creates a network embedding model capturing cross-county visitation networks, as well as heterogeneous features to uncover clusters of counties in the United States based on their pandemic spread transmission trajectories. We collected and computed location intelligence features from 2,787 counties from March 3 to June 29, 2020 (initial wave). Second, we constructed a human visitation network, which incorporated county features as node attributes, and visits between counties as network edges. Our attributed network embeddings approach integrates both typological characteristics of the cross-county visitation network, as well as heterogeneous features. We conducted clustering analysis on the attributed network embeddings to reveal four archetypes of spread risk trajectories corresponding to four clusters of counties. Subsequently, we identified four features as important features underlying the distinctive transmission risk patterns among the archetypes. The attributed network embedding approach and the findings identify and explain the non-homogenous pandemic risk trajectories across counties for predictive pandemic monitoring. The study also contributes to data-driven and deep learning-based approaches for pandemic analytics to complement the standard epidemiological models for policy analysis in pandemics.

archetype, data mining, machine learning, (19 more...)

2209.09448

Country:

North America > United States > Arkansas > Cross County (0.46)
North America > United States > Texas > Brazos County > College Station (0.14)
South America > Brazil (0.04)
(12 more...)

Genre: Research Report > New Finding (0.47)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

TECM: Transfer Learning-based Evidential C-Means Clustering

Jiao, Lianmeng, Wang, Feng, Liu, Zhun-ga, Pan, Quan

As a representative evidential clustering algorithm, evidential c-means (ECM) provides a deeper insight into the data by allowing an object to belong not only to a single class, but also to any subset of a collection of classes, which generalizes the hard, fuzzy, possibilistic, and rough partitions. However, compared with other partition-based algorithms, ECM must estimate numerous additional parameters, and thus insufficient or contaminated data will have a greater influence on its clustering performance. To solve this problem, in this study, a transfer learning-based ECM (TECM) algorithm is proposed by introducing the strategy of transfer learning into the process of evidential clustering. The TECM objective function is constructed by integrating the knowledge learned from the source domain with the data in the target domain to cluster the target data. Subsequently, an alternate optimization scheme is developed to solve the constraint objective function of the TECM algorithm. The proposed TECM algorithm is applicable to cases where the source and target domains have the same or different numbers of clusters. A series of experiments were conducted on both synthetic and real datasets, and the experimental results demonstrated the effectiveness of the proposed TECM algorithm compared to ECM and other representative multitask or transfer-clustering algorithms.

algorithm, artificial intelligence, machine learning, (17 more...)

doi: 10.1016/j.knosys.2022.109937

2112.10152

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Tian, Jingguang, Hu, Xinhui, Xu, Xinkang

The Royalflush System for VoxCeleb Speaker Recognition Challenge 2022

In this technical report, we describe the Royalflush submissions for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our submissions contain track 1, which is for supervised speaker verification and track 3, which is for semi-supervised speaker verification. For track 1, we develop a powerful U-Net-based speaker embedding extractor with a symmetric architecture. The proposed system achieves 2.06% in EER and 0.1293 in MinDCF on the validation set. Compared with the state-of-the-art ECAPA-TDNN, it obtains a relative improvement of 20.7% in EER and 22.70% in MinDCF. For track 3, we employ the joint training of source domain supervision and target domain self-supervision to get a speaker embedding extractor. The subsequent clustering process can obtain target domain pseudo-speaker labels. We adapt the speaker embedding extractor using all source and target domain data in a supervised manner, where it can fully leverage both domain information. Moreover, clustering and supervised domain adaptation can be repeated until the performance converges on the validation set. Our final submission is a fusion of 10 models and achieves 7.75% EER and 0.3517 MinDCF on the validation set.

artificial intelligence, machine learning, pattern recognition, (13 more...)

2209.0901

Country: Asia > China > Zhejiang Province (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Explainable Clustering via Exemplars: Complexity and Efficient Approximation Algorithms

Davidson, Ian, Livanos, Michael, Gourru, Antoine, Walker, Peter, Velcin, Julien, Ravi, S. S.

Explainable AI (XAI) is an important developing area but remains relatively understudied for clustering. We propose an explainable-by-design clustering approach that not only finds clusters but also exemplars to explain each cluster. The use of exemplars for understanding is supported by the exemplar-based school of concept definition in psychology. We show that finding a small set of exemplars to explain even a single cluster is computationally intractable; hence, the overall problem is challenging. We develop an approximation algorithm that provides provable performance guarantees with respect to clustering quality as well as the number of exemplars used. This basic algorithm explains all the instances in every cluster whilst another approximation algorithm uses a bounded number of exemplars to allow simpler explanations and provably covers a large fraction of all the instances. Experimental results show that our work is useful in domains involving difficult to understand deep embeddings of images and text.

exemplar, machine learning, natural language, (16 more...)

2209.0967

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Yolo County > Davis (0.04)
(4 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Government > Regional Government > North America Government > United States Government (0.67)
Government > Regional Government > Europe Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

Jeon, Hyeon, Aupetit, Michael, Shin, DongHwa, Cho, Aeri, Park, Seokhyeon, Seo, Jinwook

We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.

artificial intelligence, dataset, machine learning, (18 more...)

2209.10042

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.05)
Asia > South Korea > Seoul > Seoul (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Puga, Clara, Niemann, Uli, Schlee, Winfried, Spiliopoulou, Myra

A cost-based multi-layer network approach for the discovery of patient phenotypes

Clinical records frequently include assessments of the characteristics of patients, which may include the completion of various questionnaires. These questionnaires provide a variety of perspectives on a patient's current state of well-being. Not only is it critical to capture the heterogeneity given by these perspectives, but there is also a growing demand for developing cost-effective technologies for clinical phenotyping. Filling out many questionnaires may be a strain for the patients and therefore costly. In this work, we propose COBALT -- a cost-based layer selector model for detecting phenotypes using a community detection approach. Our goal is to minimize the number of features used to build these phenotypes while preserving its quality. We test our model using questionnaire data from chronic tinnitus patients and represent the data in a multi-layer network structure. The model is then evaluated by predicting post-treatment data using baseline features (age, gender, and pre-treatment data) as well as the identified phenotypes as a feature. For some post-treatment variables, predictors using phenotypes from COBALT as features outperformed those using phenotypes detected by traditional clustering methods. Moreover, using phenotype data to predict post-treatment data proved beneficial in comparison with predictors that were solely trained with baseline features.

data mining, machine learning, node, (19 more...)

2209.09032

Country:

Europe > Netherlands > South Holland > Leiden (0.05)
Europe > Germany > Bavaria > Regensburg (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Health & Medicine > Health Care Technology > Medical Record (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding

Blum, Hermann, Müller, Marcus G., Gawel, Abel, Siegwart, Roland, Cadena, Cesar

In order to operate in human environments, a robot's semantic perception has to overcome open-world challenges such as novel objects and domain gaps. Autonomous deployment to such environments therefore requires robots to update their knowledge and learn without supervision. We investigate how a robot can autonomously discover novel semantic classes and improve accuracy on known classes when exploring an unknown environment. To this end, we develop a general framework for mapping and clustering that we then use to generate a self-supervised learning signal to update a semantic segmentation model. In particular, we show how clustering parameters can be optimized during deployment and that fusion of multiple observation modalities improves novel object discovery compared to prior work. Models, data, and implementations can be found at github.com/hermannsblum/scim.

machine learning, natural language, prediction, (17 more...)

2206.1067

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

arXiv.org Artificial IntelligenceSep-19-2022

DADApy: Distance-based Analysis of DAta-manifolds in Python

Glielmo, Aldo, Macocco, Iuri, Doimo, Diego, Carli, Matteo, Zeni, Claudio, Wild, Romina, d'Errico, Maria, Rodriguez, Alex, Laio, Alessandro

DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.

artificial intelligence, dadapy, machine learning, (17 more...)

doi: 10.1016/j.patter.2022.100589

2205.03373

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)