Biecek, Przemysław
MASCOTS: Model-Agnostic Symbolic COunterfactual explanations for Time Series
Płudowski, Dawid, Spinnato, Francesco, Wilczyński, Piotr, Kotowski, Krzysztof, Ntagiou, Evridiki Vasileia, Guidotti, Riccardo, Biecek, Przemysław
Counterfactual explanations provide an intuitive way to understand model decisions by identifying minimal changes required to alter an outcome. However, applying counterfactual methods to time series models remains challenging due to temporal dependencies, high dimensionality, and the lack of an intuitive human-interpretable representation. We introduce MASCOTS, a method that leverages the Bag-of-Receptive-Fields representation alongside symbolic transformations inspired by Symbolic Aggregate Approximation. By operating in a symbolic feature space, it enhances interpretability while preserving fidelity to the original data and model. Unlike existing approaches that either depend on model structure or autoencoder-based sampling, MASCOTS directly generates meaningful and diverse counterfactual observations in a model-agnostic manner, operating on both univariate and multivariate data. We evaluate MASCOTS on univariate and multivariate benchmark datasets, demonstrating comparable validity, proximity, and plausibility to state-of-the-art methods, while significantly improving interpretability and sparsity. Its symbolic nature allows for explanations that can be expressed visually, in natural language, or through semantic representations, making counterfactual reasoning more accessible and actionable.
The Role of Hyperparameters in Predictive Multiplicity
Cavus, Mustafa, Woźnica, Katarzyna, Biecek, Przemysław
This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.
Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models
Mieleszczenko-Kowszewicz, Wiktoria, Bajcar, Beata, Babiak, Jolanta, Dyczek, Berenika, Świstak, Jakub, Biecek, Przemysław
Be careful what you ask for, you just might get it. This saying fits with the way large language models (LLMs) are trained, which, instead of being rewarded for correctness, are increasingly rewarded for pleasing the recipient. So, they are increasingly effective at persuading us that their answers are valuable. But what tricks do they use in this persuasion? In this study, we examine what are the psycholinguistic features of the responses used by twelve different language models. By grouping response content according to rational or emotional prompts and exploring social influence principles employed by LLMs, we ask whether and how we can mitigate the risks of LLM-driven mass misinformation. We position this study within the broader discourse on human-centred AI, emphasizing the need for interdisciplinary approaches to mitigate cognitive and societal risks posed by persuasive AI responses.
The Dark Patterns of Personalized Persuasion in Large Language Models: Exposing Persuasive Linguistic Features for Big Five Personality Traits in LLMs Responses
Mieleszczenko-Kowszewicz, Wiktoria, Płudowski, Dawid, Kołodziejczyk, Filip, Świstak, Jakub, Sienkiewicz, Julian, Biecek, Przemysław
This study explores how the Large Language Models (LLMs) adjust linguistic features to create personalized persuasive outputs. While research showed that LLMs personalize outputs, a gap remains in understanding the linguistic features of their persuasive capabilities. We identified 13 linguistic features crucial for influencing personalities across different levels of the Big Five model of personality. We analyzed how prompts with personality trait information influenced the output of 19 LLMs across five model families. The findings show that models use more anxiety-related words for neuroticism, increase achievement-related words for conscientiousness, and employ fewer cognitive processes words for openness to experience. Some model families excel at adapting language for openness to experience, others for conscientiousness, while only one model adapts language for neuroticism. Our findings show how LLMs tailor responses based on personality cues in prompts, indicating their potential to create persuasive content affecting the mind and well-being of the recipients.
An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification
Cavus, Mustafa, Biecek, Przemysław
Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical pre-processing steps in the modeling process. However, there have been debates and questioning of the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, which is called the Rashomon effect, in model selection. Selecting one of them without considering predictive multiplicity which is the case of yielding conflicting models' predictions for any sample may lead to a loss of using another model. In this study, in addition to the existing debates, the impact of balancing methods on predictive multiplicity is examined through the Rashomon effect. It is important because the blind model selection is risky from a set of approximately equally accurate models. This may lead to serious problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect. Our findings showed that balancing methods inflate the predictive multiplicity, and they yield varying results. To monitor the trade-off between performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended performance-gain plot for the Rashomon effect.
Position: Do Not Explain Vision Models Without Context
Tomaszewska, Paulina, Biecek, Przemysław
Does the stethoscope in the picture make the adjacent person a doctor or a patient? This, of course, depends on the contextual relationship of the two objects. If it's obvious, why don't explanation methods for vision models use contextual information? In this paper, we (1) review the most popular methods of explaining computer vision models by pointing out that they do not take into account context information, (2) show examples of failures of popular XAI methods, (3) provide examples of real-world use cases where spatial context plays a significant role, (4) propose new research directions that may lead to better use of context information in explaining computer vision models, (5) argue that a change in approach to explanations is needed from 'where' to 'how'.
Global Counterfactual Directions
Sobieski, Bartlomiej, Biecek, Przemysław
Despite increasing progress in development of methods for generating visual counterfactual explanations, especially with the recent rise of Denoising Diffusion Probabilistic Models, previous works consider them as an entirely local technique. In this work, we take the first step at globalizing them. Specifically, we discover that the latent space of Diffusion Autoencoders encodes the inference process of a given classifier in the form of global directions. We propose a novel proxy-based approach that discovers two types of these directions with the use of only single image in an entirely black-box manner. Precisely, g-directions allow for flipping the decision of a given classifier on an entire dataset of images, while h-directions further increase the diversity of explanations. We refer to them in general as Global Counterfactual Directions (GCDs). Moreover, we show that GCDs can be naturally combined with Latent Integrated Gradients resulting in a new black-box attribution method, while simultaneously enhancing the understanding of counterfactual explanations. We validate our approach on existing benchmarks and show that it generalizes to real-world use-cases.
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Hryniewska-Guzik, Weronika, Longo, Luca, Biecek, Przemysław
Deep learning models, despite their unprecedented success [1, 2], lack full transparency and interpretability in their decision-making processes [3, 4]. This has led to growing concerns about the use of "black box" models and the need for explanations to better understand their inferential process [5]. Using examples of specific cases from a dataset, generated explanations might reveal which elements are most important in a model's prediction [6, 7, 8]. Currently, explanations generated for a trained deep learning models often are presented as individual insights that need to be investigated separately and then compared [9]. Each explanation provides a limited view of the model's decision, as it tends to focuse on specific aspects, making it challenging for a human to obtain a comprehensive understanding. This approach hinders the ability to discern the reasons behind a model's predictions. There has been an emerging trend in explanation ensembling, which is derived from model ensembling, which involves combining multiple predictive models to reduce variation of predictions which often leads to higher overall performance. Examples of such predictive techniques are random forests [10] and gradient boosting [11]. This tendency shows that it is plausible that individual explanations possess unique pieces of information that, when combined, might form a more comprehensive and accurate understanding of a model's inferential process.
A comparative analysis of deep learning models for lung segmentation on X-ray images
Hryniewska-Guzik, Weronika, Bilski, Jakub, Chrostowski, Bartosz, Sbahi, Jakub Drak, Biecek, Przemysław
In the field of medical imaging, accurate segmentation of lungs on X-rays is important in many applications [6], from early disease detection to treatment planning and patient monitoring. As healthcare evolves, the need for fast and accurate tools grows, implying physician support with deep learning approaches [5]. In particular, solutions such as U-Net demonstrate the potential to automate the task of lung segmentation, offering promising advances in improved accuracy [8]. However, despite these advances, the inevitable diversity of X-ray images makes it difficult for some modern segmentation methods to deal with them. Although many solutions show high performance in simple and typical cases, their performance degrades when confronted with complex ones. Moreover, the issue of using pre-trained models on images with different characteristics may have potential negative consequences when used for real-world solutions [3]. Recognizing these challenges, our objective is to analyze existing solutions for lung segmentation and systematically evaluate their performance across a dataset of varying characteristics. In this study, we analyze and compare three prominent methods - Lung VAE, TransResUNet, and CE-Net - using five image modifications. The ultimate goal is to determine the most accurate method for lung segmentation in diverse scenarios.
Interpretable Machine Learning for Survival Analysis
Langbein, Sophie Hanna, Krzyziński, Mateusz, Spytek, Mikołaj, Baniecki, Hubert, Biecek, Przemysław, Wright, Marvin N.
With the spread and rapid advancement of black box machine learning models, the field of interpretable machine learning (IML) or explainable artificial intelligence (XAI) has become increasingly important over the last decade. This is particularly relevant for survival analysis, where the adoption of IML techniques promotes transparency, accountability and fairness in sensitive areas, such as clinical decision making processes, the development of targeted therapies, interventions or in other medical or healthcare related contexts. More specifically, explainability can uncover a survival model's potential biases and limitations and provide more mathematically sound ways to understand how and which features are influential for prediction or constitute risk factors. However, the lack of readily available IML methods may have deterred medical practitioners and policy makers in public health from leveraging the full potential of machine learning for predicting time-to-event data. We present a comprehensive review of the limited existing amount of work on IML methods for survival analysis within the context of the general IML taxonomy. In addition, we formally detail how commonly used IML methods, such as such as individual conditional expectation (ICE), partial dependence plots (PDP), accumulated local effects (ALE), different feature importance measures or Friedman's H-interaction statistics can be adapted to survival outcomes. An application of several IML methods to real data on data on under-5 year mortality of Ghanaian children from the Demographic and Health Surveys (DHS) Program serves as a tutorial or guide for researchers, on how to utilize the techniques in practice to facilitate understanding of model decisions or predictions.