AITopics

2301.13066

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Iran > East Azerbaijan Province > Tabriz (0.04)
Asia > Azerbaijan (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Information Technology > Services (1.00)
Government > Regional Government > North America Government > United States Government (0.46)
Leisure & Entertainment > Sports > Soccer (0.35)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Yasir, Mustafa, Palowitch, John, Tsitsulin, Anton, Tran-Thanh, Long, Perozzi, Bryan

Examining the Effects of Degree Distribution and Homophily in Graph Learning Models

Despite a surge in interest in GNN development, homogeneity in benchmarking datasets still presents a fundamental issue to GNN research. GraphWorld is a recent solution which uses the Stochastic Block Model (SBM) to generate diverse populations of synthetic graphs for benchmarking any GNN task. Despite its success, the SBM imposed fundamental limitations on the kinds of graph structure GraphWorld could create. In this work we examine how two additional synthetic graph generators can improve GraphWorld's evaluation; LFR, a well-established model in the graph clustering literature and CABAM, a recent adaptation of the Barabasi-Albert model tailored for GNN benchmarking. By integrating these generators, we significantly expand the coverage of graph space within the GraphWorld framework while preserving key graph properties observed in real-world networks. To demonstrate their effectiveness, we generate 300,000 graphs to benchmark 11 GNN models on a node classification task. We find GNN performance variations in response to homophily, degree distribution and feature signal. Based on these findings, we classify models by their sensitivity to the new generators under these properties. Additionally, we release the extensions made to GraphWorld on the GitHub repository, offering further evaluation of GNN performance on new graphs.

artificial intelligence, graph, machine learning, (16 more...)

2307.08881

Country:

North America > United States > New York > Nassau County > Long Beach (0.05)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Elhussein, Ahmed, Gursoy, Gamze

Privacy-preserving patient clustering for personalized federated learning

Federated Learning (FL) is a machine learning framework that enables multiple organizations to train a model without sharing their data with a central server. However, it experiences significant performance degradation if the data is non-identically independently distributed (non-IID). This is a problem in medical settings, where variations in the patient population contribute significantly to distribution differences across hospitals. Personalized FL addresses this issue by accounting for site-specific distribution differences. Clustered FL, a Personalized FL variant, was used to address this problem by clustering patients into groups across hospitals and training separate models on each group. However, privacy concerns remained as a challenge as the clustering process requires exchange of patient-level information. This was previously solved by forming clusters using aggregated data, which led to inaccurate groups and performance degradation. In this study, we propose Privacy-preserving Community-Based Federated machine Learning (PCBFL), a novel Clustered FL framework that can cluster patients using patient-level data while protecting privacy. PCBFL uses Secure Multiparty Computation, a cryptographic technique, to securely calculate patient-level similarity scores across hospitals. We then evaluate PCBFL by training a federated mortality prediction model using 20 sites from the eICU dataset. We compare the performance gain from PCBFL against traditional and existing Clustered FL frameworks. Our results show that PCBFL successfully forms clinically meaningful cohorts of low, medium, and high-risk patients. PCBFL outperforms traditional and existing Clustered FL frameworks with an average AUC improvement of 4.3% and AUPRC improvement of 7.8%.

artificial intelligence, data mining, machine learning, (18 more...)

2307.08847

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Virginia (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology (0.93)
Health & Medicine > Diagnostic Medicine (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Data Science > Data Mining > Big Data (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

CohortFinder: an open-source tool for data-driven partitioning of biomedical image cohorts to yield robust machine learning models

Fan, Fan, Martinez, Georgia, Desilvio, Thomas, Shin, John, Chen, Yijiang, Wang, Bangchen, Ozeki, Takaya, Lafarge, Maxime W., Koelzer, Viktor H., Barisoni, Laura, Madabhushi, Anant, Viswanath, Satish E., Janowczyk, Andrew

Batch effects (BEs) refer to systematic technical differences in data collection unrelated to biological variations whose noise is shown to negatively impact machine learning (ML) model generalizability. Here we release CohortFinder, an open-source tool aimed at mitigating BEs via data-driven cohort partitioning. We demonstrate CohortFinder improves ML model performance in downstream medical image processing tasks. CohortFinder is freely available for download at cohortfinder.com.

artificial intelligence, cohortfinder, machine learning, (17 more...)

2307.08673

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > North Carolina > Durham County > Durham (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Beyond Intuition, a Framework for Applying GPs to Real-World Data

Tazi, Kenza, Lin, Jihao Andreas, Viljoen, Ross, Gardner, Alex, John, ST, Ge, Hong, Turner, Richard E.

Gaussian Processes (GPs) offer an attractive method for regression over small, structured and correlated datasets. However, their deployment is hindered by computational costs and limited guidelines on how to apply GPs beyond simple low-dimensional datasets. We propose a framework to identify the suitability of GPs to a given problem and how to set up a robust and well-specified GP model. The guidelines formalise the decisions of experienced GP practitioners, with an emphasis on kernel design and options for computational scalability. The framework is then applied to a case study of glacier elevation change yielding more accurate results at test time.

artificial intelligence, data mining, machine learning, (20 more...)

2307.03093

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
North America > Greenland (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
(4 more...)

Genre:

Workflow (0.52)
Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
(2 more...)

Ghaderi, Hamid, Foreman, Brandon, Nayebi, Amin, Tipirneni, Sindhu, Reddy, Chandan K., Subbian, Vignesh

Identifying TBI Physiological States by Clustering Multivariate Clinical Time-Series Data

Determining clinically relevant physiological states from multivariate time series data with missing values is essential for providing appropriate treatment for acute conditions such as Traumatic Brain Injury (TBI), respiratory failure, and heart failure. Utilizing non-temporal clustering or data imputation and aggregation techniques may lead to loss of valuable information and biased analyses. In our study, we apply the SLAC-Time algorithm, an innovative self-supervision-based approach that maintains data integrity by avoiding imputation or aggregation, offering a more useful representation of acute patient states. By using SLAC-Time to cluster data in a large research dataset, we identified three distinct TBI physiological states and their specific feature profiles. We employed various clustering evaluation metrics and incorporated input from a clinical domain expert to validate and interpret the identified physiological states. Further, we discovered how specific clinical events and interventions can influence patient states and state transitions.

data mining, machine learning, physiological state, (16 more...)

2303.13024

Country:

North America > United States > Arizona > Pima County > Tucson (0.14)
North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science > Data Mining (0.88)
Information Technology > Data Science > Data Quality (0.68)

Langerbein, Janine, Massing, Till, Klenke, Jens, Reckmann, Natalie, Striewe, Michael, Goedicke, Michael, Hanck, Christoph

A Data Mining Approach for Detecting Collusion in Unproctored Online Exams

Due to the precautionary measures during the COVID-19 pandemic many universities offered unproctored take-home exams. We propose methods to detect potential collusion between students and apply our approach on event log data from take-home exams during the pandemic. We find groups of students with suspiciously similar exams. In addition, we compare our findings to a proctored control group. By this, we establish a rule of thumb for evaluating which cases are "outstandingly similar", i.e., suspicious cases.

artificial intelligence, data mining, machine learning, (18 more...)

2302.07014

Country:

Europe > Germany (0.14)
Europe > Austria > Vienna (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.86)

Industry:

Information Technology > Security & Privacy (0.68)
Education > Educational Setting (0.68)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)
Health & Medicine > Therapeutic Area > Immunology (0.34)

Technology:

Information Technology > Security & Privacy (0.68)
Information Technology > Data Science > Data Mining (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.30)

Marette, Thibault, Miettinen, Pauli, Neumann, Stefan

Visualizing Overlapping Biclusterings and Boolean Matrix Factorizations

Finding (bi-)clusters in bipartite graphs is a popular data analysis approach. Analysts typically want to visualize the clusters, which is simple as long as the clusters are disjoint. However, many modern algorithms find overlapping clusters, making visualization more complicated. In this paper, we study the problem of visualizing \emph{a given clustering} of overlapping clusters in bipartite graphs and the related problem of visualizing Boolean Matrix Factorizations. We conceptualize three different objectives that any good visualization should satisfy: (1) proximity of cluster elements, (2) large consecutive areas of elements from the same cluster, and (3) large uninterrupted areas in the visualization, regardless of the cluster membership. We provide objective functions that capture these goals and algorithms that optimize these objective functions. Interestingly, in experiments on real-world datasets, we find that the best trade-off between these competing goals is achieved by a novel heuristic, which locally aims to place rows and columns with similar cluster membership next to each other.

artificial intelligence, bicluster, machine learning, (17 more...)

2307.07396

Country:

North America > United States > Pennsylvania (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Finland > Northern Savo > Kuopio (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

A testing-based approach to assess the clusterability of categorical data

Hu, Lianyu, Dong, Junjie, Jiang, Mudi, Liu, Yan, He, Zengyou

The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly correlated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.

artificial intelligence, categorical data, machine learning, (16 more...)

2307.07346

Country: Asia > China > Liaoning Province > Dalian (0.05)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Graziani, Mara, Mahony, Laura O', Nguyen, An-Phi, Müller, Henning, Andrearczyk, Vincent

Uncovering Unique Concept Vectors through Latent Space Decomposition

Interpreting the inner workings of deep learning models is crucial for establishing trust and ensuring model safety. Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates such as pixel saliency. However, defining the concepts for the interpretability analysis biases the explanations by the user's expectations on the concepts. To address this, we propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. By decomposing the latent space of a layer in singular vectors and refining them by unsupervised clustering, we uncover concept vectors aligned with directions of high variance that are relevant to the model prediction, and that point to semantically distinct concepts. Our extensive experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand. Moreover, we showcase the practical utility of our method in dataset exploration, where our concept vectors successfully identify outlier training samples affected by various confounding factors. This novel exploration technique has remarkable versatility to data types and model architectures and it will facilitate the identification of biases and the discovery of sources of error within training data.

artificial intelligence, machine learning, vector, (20 more...)

2307.06913

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Ireland (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)