AITopics | subgroup discovery

Collaborating Authors

subgroup discovery

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Subgroup Discovery with the Cox Model

Izzo, Zachary, Melvin, Iain

arXiv.org Machine LearningDec-25-2025

We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. Our work is the first to study this particular subgroup problem, for which we make several contributions. Subgroup discovery methods generally require a "quality function" in order to sift through and select the most advantageous subgroups. We first examine why existing natural choices for quality functions are insufficient to solve the subgroup discovery problem for the Cox model. To address the shortcomings of existing metrics, we introduce two technical innovations: the *expected prediction entropy (EPE)*, a novel metric for evaluating survival models which predict a hazard function; and the *conditional rank statistics (CRS)*, a statistical object which quantifies the deviation of an individual point to the distribution of survival times in an existing subgroup. We study the EPE and CRS theoretically and show that they can solve many of the problems with existing metrics. We introduce a total of eight algorithms for the Cox subgroup discovery problem. The main algorithm is able to take advantage of both the EPE and the CRS, allowing us to give theoretical correctness results for this algorithm in a well-specified setting. We evaluate all of the proposed methods empirically on both synthetic and real data. The experiments confirm our theory, showing that our contributions allow for the recovery of a ground-truth subgroup in well-specified cases, as well as leading to better model fit compared to naively fitting the Cox model to the whole dataset in practical settings. Lastly, we conduct a case study on jet engine simulation data from NASA. The discovered subgroups uncover known nonlinearities/homogeneity in the data, and which suggest design choices which have been mirrored in practice.

cox model, dataset, subgroup, (16 more...)

arXiv.org Machine Learning

2512.20762

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom (0.04)
Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Learning Subgroups with Maximum Treatment Effects without Causal Heuristics

Yang, Lincen, Li, Zhong, van Leeuwen, Matthijs, Salehkaleybar, Saber

arXiv.org Artificial IntelligenceNov-26-2025

Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior work is formulated in the potential outcome framework, the corresponding structural causal model (SCM) for this task has been largely overlooked. In practice, two approaches dominate. The first estimates pointwise conditional treatment effects and then fits a tree on those estimates, effectively turning subgroup estimation into the harder problem of accurate pointwise estimation. The second constructs decision trees or rule sets with ad-hoc 'causal' heuristics, typically without rigorous justification for why a given heuristic may be used or whether such heuristics are necessary at all. We address these issues by studying the problem directly under the SCM framework. Under the assumption of a partition-based model, we show that optimal subgroup discovery reduces to recovering the data-generating models and hence a standard supervised learning problem (regression or classification). This allows us to adopt any partition-based methods to learn the subgroup from data. We instantiate the approach with CART, arguably one of the most widely used tree-based methods, to learn the subgroup with maximum treatment effect. Finally, on a large collection of synthetic and semi-synthetic datasets, we compare our method against a wide range of baselines and find that our approach, which avoids such causal heuristics, more accurately identifies subgroups with maximum treatment effect. Our source code is available at https://github.com/ylincen/causal-subgroup.

artificial intelligence, decision tree learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.20189

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Add feedback

Causal Clustering for Conditional Average Treatment Effects Estimation and Subgroup Discovery

Wang, Zilong, Ayer, Turgay, Yang, Shihao

arXiv.org Machine LearningSep-18-2025

Estimating heterogeneous treatment effects is critical in domains such as personalized medicine, resource allocation, and policy evaluation. A central challenge lies in identifying subpopulations that respond differently to interventions, thereby enabling more targeted and effective decision-making. While clustering methods are well-studied in unsupervised learning, their integration with causal inference remains limited. We propose a novel framework that clusters individuals based on estimated treatment effects using a learned kernel derived from causal forests, revealing latent subgroup structures. Our approach consists of two main steps. First, we estimate debiased Conditional Average Treatment Effects (CATEs) using orthogonalized learners via the Robinson decomposition, yielding a kernel matrix that encodes sample-level similarities in treatment responsiveness. Second, we apply kernelized clustering to this matrix to uncover distinct, treatment-sensitive subpopulations and compute cluster-level average CATEs. We present this kernelized clustering step as a form of regularization within the residual-on-residual regression framework. Through extensive experiments on semi-synthetic and real-world datasets, supported by ablation studies and exploratory analyses, we demonstrate the effectiveness of our method in capturing meaningful treatment effect heterogeneity.

dataset, kernel, treatment effect, (16 more...)

arXiv.org Machine Learning

2509.05775

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.68)
Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Subgroup Performance Analysis in Hidden Stratifications

Bissoto, Alceu, Hoang, Trung-Dung, Flühmann, Tim, Sun, Susu, Baumgartner, Christian F., Koch, Lisa M.

arXiv.org Artificial IntelligenceMar-13-2025

Machine learning (ML) models may suffer from significant performance disparities between patient groups. Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient. Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g., patient sex) sufficiently reflects the main reasons for performance variability, which is not common. Subgroup discovery techniques that identify cohesive subgroups based on learned feature representations appear as a potential solution: They could expose hidden stratifications and provide more granular subgroup performance reports. However, subgroup discovery is challenging to evaluate even as a standalone task, as ground truth stratification labels do not exist in real data. Subgroup discovery has thus neither been applied nor evaluated for the application of subgroup performance monitoring. Here, we apply subgroup discovery for performance monitoring in chest x-ray and skin lesion classification. We propose novel evaluation strategies and show that a simplified subgroup discovery method without access to classification labels or metadata can expose larger performance disparities than traditional metadata-based subgroup analysis. We provide the first compelling evidence that subgroup discovery can serve as an important tool for comprehensive performance validation and monitoring of trustworthy AI in medicine.

performance gap, subgroup, subgroup discovery, (11 more...)

arXiv.org Artificial Intelligence

2503.10382

Country:

Europe > Switzerland > Lucerne > Lucerne (0.04)
Europe > Switzerland > Bern > Bern (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Dermatology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fairpriori: Improving Biased Subgroup Discovery for Deep Neural Network Fairness

Zhou, Kacy, Wen, Jiawen, Yang, Nan, Yuan, Dong, Lu, Qinghua, Chen, Huaming

arXiv.org Artificial IntelligenceJun-24-2024

While deep learning has become a core functional module of most software systems, concerns regarding the fairness of ML predictions have emerged as a significant issue that affects prediction results due to discrimination. Intersectional bias, which disproportionately affects members of subgroups, is a prime example of this. For instance, a machine learning model might exhibit bias against darker-skinned women, while not showing bias against individuals with darker skin or women. This problem calls for effective fairness testing before the deployment of such deep learning models in real-world scenarios. However, research into detecting such bias is currently limited compared to research on individual and group fairness. Existing tools to investigate intersectional bias lack important features such as support for multiple fairness metrics, fast and efficient computation, and user-friendly interpretation. This paper introduces Fairpriori, a novel biased subgroup discovery method, which aims to address these limitations. Fairpriori incorporates the frequent itemset generation algorithm to facilitate effective and efficient investigation of intersectional bias by producing fast fairness metric calculations on subgroups of a dataset. Through comparison with the state-of-the-art methods (e.g., Themis, FairFictPlay, and TestSGD) under similar conditions, Fairpriori demonstrates superior effectiveness and efficiency when identifying intersectional bias. Specifically, Fairpriori is easier to use and interpret, supports a wider range of use cases by accommodating multiple fairness metrics, and exhibits higher efficiency in computing fairness metrics. These findings showcase Fairpriori's potential for effectively uncovering subgroups affected by intersectional bias, supported by its open-source tooling at https://anonymous.4open.science/r/Fairpriori-0320.

fairness metric, fairpriori, subgroup, (16 more...)

arXiv.org Artificial Intelligence

2407.01595

Country:

North America > United States (0.28)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Using Constraints to Discover Sparse and Alternative Subgroup Descriptions

Bach, Jakob

arXiv.org Artificial IntelligenceJun-3-2024

Subgroup-discovery methods allow users to obtain simple descriptions of interesting regions in a dataset. Using constraints in subgroup discovery can enhance interpretability even further. In this article, we focus on two types of constraints: First, we limit the number of features used in subgroup descriptions, making the latter sparse. Second, we propose the novel optimization problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features. We describe how to integrate both constraint types into heuristic subgroup-discovery methods. Further, we propose a novel Satisfiability Modulo Theories (SMT) formulation of subgroup discovery as a white-box optimization problem, which allows solver-based search for subgroups and is open to a variety of constraint types. Additionally, we prove that both constraint types lead to an NP-hard optimization problem. Finally, we employ 27 binary-classification datasets to compare heuristic and solver-based search for unconstrained and constrained subgroup discovery. We observe that heuristic search methods often yield high-quality subgroups within a short runtime, also in scenarios with constraints.

constraint, subgroup, subgroup description, (16 more...)

arXiv.org Artificial Intelligence

2406.01411

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > Slovenia > Upper Carniola > Municipality of Bled > Bled (0.04)
Europe > Germany > Berlin (0.04)
(31 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning Exceptional Subgroups by End-to-End Maximizing KL-divergence

Xu, Sascha, Walter, Nils Philipp, Kalofolias, Janis, Vreeken, Jilles

arXiv.org Artificial IntelligenceFeb-20-2024

Finding and describing sub-populations that are exceptional regarding a target property has important applications in many scientific disciplines, from identifying disadvantaged demographic groups in census data to finding conductive molecules within gold nanoparticles. Current approaches to finding such subgroups require pre-discretized predictive variables, do not permit non-trivial target distributions, do not scale to large datasets, and struggle to find diverse results. To address these limitations, we propose Syflow, an end-to-end optimizable approach in which we leverage normalizing flows to model arbitrary target distributions, and introduce a novel neural layer that results in easily interpretable subgroup descriptions. We demonstrate on synthetic and real-world data, including a case study, that Syflow reliably finds highly exceptional subgroups accompanied by insightful descriptions.

learning exceptional subgroup, subgroup, target distribution, (11 more...)

arXiv.org Artificial Intelligence

2402.1293

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia (0.04)

Genre: Research Report (0.82)

Industry: Automobiles & Trucks (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Subgroup Discovery in MOOCs: A Big Data Application for Describing Different Types of Learners

Luna, J. M., Fardoun, H. M., Padillo, F., Romero, C., Ventura, S.

arXiv.org Artificial IntelligenceFeb-10-2024

The aim of this paper is to categorize and describe different types of learners in massive open online courses (MOOCs) by means of a subgroup discovery approach based on MapReduce. The final objective is to discover IF-THEN rules that appear in different MOOCs. The proposed subgroup discovery approach, which is an extension of the well-known FP-Growth algorithm, considers emerging parallel methodologies like MapReduce to be able to cope with extremely large datasets. As an additional feature, the proposal includes a threshold value to denote the number of courses that each discovered rule should satisfy. A post-processing step is also included so redundant subgroups can be removed. The experimental stage is carried out by considering de-identified data from the first year of 16 MITx and HarvardX courses on the edX platform. Experimental results demonstrate that the proposed MapReduce approach outperforms traditional sequential subgroup discovery approaches, achieving a runtime that is almost constant for different courses. Additionally, thanks to the final post-processing step, only interesting and not-redundant rules are discovered, hence reducing the number of subgroups in one or two orders of magnitude. Finally, the discovered subgroups are easily used by courses' instructors not only for descriptive purposes but also for additional tasks such as recommendation or personalization.

learner, student, subgroup, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10115-022-01674-9

2403.05555

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Spain > Andalusia > Córdoba Province > Córdoba (0.04)
Asia > Middle East > Saudi Arabia (0.04)

Genre:

Research Report (1.00)
Instructional Material > Online (1.00)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

Tree-based Subgroup Discovery In Electronic Health Records: Heterogeneity of Treatment Effects for DTG-containing Therapies

Yang, Jiabei, Mwangi, Ann W., Kantor, Rami, Dahabreh, Issa J., Nyambura, Monicah, Delong, Allison, Hogan, Joseph W., Steingrimsson, Jon A.

arXiv.org Machine LearningAug-30-2022

However, estimating treatment effects using EHR data poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the Subgroup Discovery for Longitudinal Data (SDLD) algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus (HIV) who are at higher risk of weight gain when receiving dolutegravir-containing antiretroviral therapies (ARTs) versus when receiving non dolutegravir-containing ARTs. Key words: Causal Inference; Dolutegravir; Electronic health record; Heterogeneity of treatment effects; Longitudinal targeted maximum likelihood estimation; Machine learning; Recursive partitioning; Subgroup discovery.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

2208.14329

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Africa > Kenya > Western Province (0.04)
Africa > Cameroon (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)
Research Report > Strength High (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology > HIV (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Subgroup Discovery in Unstructured Data

Arab, Ali, Arora, Dev, Lu, Jialin, Ester, Martin

arXiv.org Artificial IntelligenceJul-15-2022

Subgroup discovery is a descriptive and exploratory data mining technique to identify subgroups in a population that exhibit interesting behavior with respect to a variable of interest. Subgroup discovery has numerous applications in knowledge discovery and hypothesis generation, yet it remains inapplicable for unstructured, high-dimensional data such as images. This is because subgroup discovery algorithms rely on defining descriptive rules based on (attribute, value) pairs, however, in unstructured data, an attribute is not well defined. Even in cases where the notion of attribute intuitively exists in the data, such as a pixel in an image, due to the high dimensionality of the data, these attributes are not informative enough to be used in a rule. In this paper, we introduce the subgroup-aware variational autoencoder, a novel variational autoencoder that learns a representation of unstructured data which leads to subgroups with higher quality. Our experimental results demonstrate the effectiveness of the method at learning subgroups with high quality while supporting the interpretability of the concepts.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2207.07781

Country:

North America > Canada (0.05)
North America > United States > New York (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area (0.46)
Materials > Metals & Mining (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback