Goto

Collaborating Authors

 sepal length


Feature Importance and Explainability in Quantum Machine Learning

arXiv.org Machine Learning

Many Machine Learning (ML) models are referred to as black box models, providing no real insights into why a prediction is made. Feature importance and explainability are important for increasing transparency and trust in ML models, particularly in settings such as healthcare and finance. With quantum computing's unique capabilities, such as leveraging quantum mechanical phenomena like superposition, which can be combined with ML techniques to create the field of Quantum Machine Learning (QML), and such techniques may be applied to QML models. This article explores feature importance and explainability insights in QML compared to Classical ML models. Utilizing the widely recognized Iris dataset, classical ML algorithms such as SVM and Random Forests, are compared against hybrid quantum counterparts, implemented via IBM's Qiskit platform: the Variational Quantum Classifier (VQC) and Quantum Support Vector Classifier (QSVC). This article aims to provide a comparison of the insights generated in ML by employing permutation and leave one out feature importance methods, alongside ALE (Accumulated Local Effects) and SHAP (SHapley Additive exPlanations) explainers.


Decision Predicate Graphs: Enhancing Interpretability in Tree Ensembles

arXiv.org Artificial Intelligence

Understanding the decisions of tree-based ensembles and their relationships is pivotal for machine learning model interpretation. Recent attempts to mitigate the human-in-the-loop interpretation challenge have explored the extraction of the decision structure underlying the model taking advantage of graph simplification and path emphasis. However, while these efforts enhance the visualisation experience, they may either result in a visually complex representation or compromise the interpretability of the original ensemble model. In addressing this challenge, especially in complex scenarios, we introduce the Decision Predicate Graph (DPG) as a model-agnostic tool to provide a global interpretation of the model. DPG is a graph structure that captures the tree-based ensemble model and learned dataset details, preserving the relations among features, logical decisions, and predictions towards emphasising insightful points. Leveraging well-known graph theory concepts, such as the notions of centrality and community, DPG offers additional quantitative insights into the model, complementing visualisation techniques, expanding the problem space descriptions, and offering diverse possibilities for extensions. Empirical experiments demonstrate the potential of DPG in addressing traditional benchmarks and complex classification scenarios.


Machine Learning with Python: Logistic Regression for Binary Classification - Pierian Training

#artificialintelligence

Logistic Regression is a statistical method used for binary classification problems, where the goal is to predict the probability of an event occurring or not. It is a popular algorithm in machine learning, particularly in the field of supervised learning. In this blog post, we will explore the fundamentals of logistic regression and how it can be used to solve binary classification problems. We will also provide Python code examples to help you understand and implement this powerful algorithm in your own projects. Whether you're new to machine learning or an experienced practitioner, this post will provide valuable insights into logistic regression and its applications. For example, a logistic regression model could be built using patient data such as age, gender, family history, and lifestyle factors to predict whether or not a patient is at high risk for developing heart disease.


An Introductory Look on NumPy and Pandas

#artificialintelligence

NumPy and Pandas are two significantly popular modules found in Python. Both modules are very popular to be main components of Machine Learning and Neural Networks studies. This article is taking these modules on board to summarize their features. Python is developed by Guido van Rossum and first released at the beginning of 90's as an open source programming language. With the increasing interest on Python, users contributed their work to the community.


How to Do Hierarchical Clustering in Python ? 5 Easy Steps Only

#artificialintelligence

Hierarchical Clustering uses the distance based approach between the neighbor datapoints for clustering. Each data point is linked to its nearest neighbors. There are two ways you can do Hierarchical clustering Agglomerative that is bottom-up approach clustering and Divisive uses top-down approaches for clustering. In this tutorial, I will use the popular approach Agglomerative way. In order to find the number of subgroups in the dataset, you use dendrogram. It allows you to see linkages, relatedness using the tree graph. You will find many use cases for this type of clustering and some of them are DNA sequencing, Sentiment Analysis, Tracking Virus Diseases e.t.c. Popular Use Cases are Hospital Resource Management, Business Process Management, and Social Network Analysis. Here we are importing dendrogram, linkage, cluster, and cophenet from the scipy.cluster.hierarchy


Significance Tests: t-Test, F-Statistic, ANOVA and More -- with Python

#artificialintelligence

This phenomenon is more prevalent in research results where the decision is solely based on the observed data. Observed data alone is not useful and reliable unless the sampling procedure is carefully designed, and strict precaution is taken to avoid sampling biases which might lurk into the data and makes result biased. You can find more details on the statistical biases here. In order to derive a scientific conclusion based on the data, we should equip ourselves to significance testing, a.k.a. Hypothesis testing is used to help you learn that the difference between two groups is not due to random chance.


Getting Started in Manipulating Data with R

#artificialintelligence

To display all the descriptive statistics without typing each command, simply use summary() command, then R will show us each of those stats and including the 25% and 75% quantile for every variable. This is similar to the describe() command from Pandas in Python. Other simple and interesting commands for stats analysis are the cor() and cov() commands, which will show us the correlation and covariance matrices between each variable.


Learning The TensorFlow Way of Linear Regression

#artificialintelligence

We will loop through batches of data points and let TensorFlow update the slope and y-intercept. Instead of generated data, we will use the iris dataset that is built into the Scikit Learn. Specifically, we will find an optimal line through data points where the x-value is the petal width and the y-value is the sepal length. We choose these two because there appears to be a linear relationship between them, as we will see in the graphs at the end. We will also talk more about the effects of different loss functions in the next section, but for now we will use the L2 loss function.


Manual Feature Engineering

#artificialintelligence

There is also a complementary Domino project available. Many data scientists deliver value to their organizations by mapping, developing, and deploying an appropriate ML solution to address a business problem. Feature engineering is useful for data scientists when assessing tradeoff decisions regarding the impact of their ML models. It is a framework for approaching ML as well as providing techniques for extracting features from raw data that can be used within the models. As Domino seeks to help data scientists accelerate their work, we reached out to AWP Pearson for permission to excerpt the chapter "Manual Feature Engineering: Manipulating Data for Fun and Profit" from the book, Machine Learning with Python for Everyone by Mark E. Fenner. Many thanks to AWP Pearson for providing the permissions to excerpt the work and enabling us to provide a complementary publicly viewable Domino project. We are going to turn our attention away from expanding our catalog of models [as mentioned previously in the book] and instead take a closer look at the data. Feature engineering refers to manipulation--addition, deletion, combination, mutation--of the features. Remember that features are attribute- value pairs, so we could add or remove columns from our data table and modify values within columns. Feature engineering can be used in a broad sense and in a narrow sense. I'm going to use it in a broad, inclusive sense and point out some gotchas along the way. Two drivers of feature engineering are (1) background knowledge from the domain of the task and (2) inspection of the data values. The first case includes a doctor's knowledge of important blood pressure thresholds or an accountant's knowledge of tax bracket levels. Another example is the use of body mass index (BMI) by medical providers and insurance companies. While it has limitations, BMI is quickly calculated from body weight and height and serves as a surrogate for a characteristic that is very hard to accurately measure: proportion of lean body mass. Inspecting the values of a feature means looking at a histogram of its distribution. For distribution-based feature engineering, we might see multimodal distributions--histograms with multiple humps--and decide to break the humps into bins. A major distinction we can make in feature engineering is when it occurs. Our primary question here is whether the feature engineering is performed inside the cross-validation loop or not.


ML in KSQL

#artificialintelligence

At HomeAway, we use Apache Kafka as the backbone for our streaming architecture. We also like to deploy machine learning models to make realtime predictions on our data streams. Confluent KSQL provides an easy to use and interactive SQL interface for performing stream processing on Kafka. Below we show how to build a model in Python and use the model in KSQL to make predictions based on a stream of data in Kafka. We use Predictive Model Markup Language (PMML) to enable the ability to train the model using the Python library Scikit-learn, but perform model inference in Java-based KSQL.