Goto

Collaborating Authors

 Ensemble Learning


Online Machine Learning Techniques for Coq: A Comparison

arXiv.org Artificial Intelligence

We present a comparison of several online machine learning techniques for tactical learning and proving in the Coq proof assistant. This work builds on top of Tactician, a plugin for Coq that learns from proofs written by the user to synthesize new proofs. This learning happens in an online manner -- meaning that Tactician's machine learning model is updated immediately every time the user performs a step in an interactive proof. This has important advantages compared to the more studied offline learning systems: (1) it provides the user with a seamless, interactive experience with Tactician and, (2) it takes advantage of locality of proof similarity, which means that proofs similar to the current proof are likely to be found close by. We implement two online methods, namely approximate $k$-nearest neighbors based on locality sensitive hashing forests and random decision forests. Additionally, we conduct experiments with gradient boosted trees in an offline setting using XGBoost. We compare the relative performance of Tactician using these three learning methods on Coq's standard library.


Individual Explanations in Machine Learning Models: A Case Study on Poverty Estimation

arXiv.org Artificial Intelligence

A. Relevance of Model Explanations in Real-World Contexts Complex estimation and decision-making tasks have traditionally been analyzed and judged by human experts. Hence, decisions have typically been able to be complemented with human-interpretable justifications, when needed, as experts can normally explain the line-of-thought that led to their own decision-making. However, in the past two decades, algorithmic decision-making has spread increasingly to many relevant societal contexts. Despite the notable enthusiasm for the potential benefit that this type of technology can bring, the underlying methods used are typically not inherently transparent, in the sense that they do not readily provide human-interpretable justifications for their decisions [1]. Moreover, in recent years there is a trend where the most successful algorithms, particularly in complex tasks like machine vision and natural language processing, tend to rely on highly complex models, which has led to a further increase in tension between accuracy and interpretability [2]. Relevant societal contexts where algorithmic decision systems have gained substantial traction include medical diagnosis and treatment [3], counter-terrorism [4], criminal justice [5], and risk assessments for credits and insurance [6]. In such impactful contexts, there is a legitimate need for providing human-interpretable explanations along with the estimations and decisions made. Indeed, lack of interpretability has become a barrier to the adoption of machine learning-based systems in many institutions and companies. Hence the value of complementing ML models with human-interpretable accounts of the statistical rationals behind their estimations, in a way that human decision-makers can more easily understand machine estimations, and even integrate their statistical rationals with qualitative information and human expert judgements.


Utilizing XGBoost training reports to improve your models

#artificialintelligence

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature.


Random forest regressor sklearn : Step By Step Implementation

#artificialintelligence

There are various hyperparameter in RandomForestRegressor class but their default values like n_estimators 100, *, criterion'mse', max_depth None, min_samples_split 2 etc. We can choose their optimal values using some hyperparametric tuning techniques like GridSearchCV and RandomSearchCV. Most Importantly, In this article, we will demonstrate you to end to end implementation of Random forest regressor sklearn. Firstly you will package using the import statement. Secondly, We will create the object of the Random forest regressor.


XGBoost Algorithm: Long May She Reign!

#artificialintelligence

Decision Tree: Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria. Bagging: Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process. Random Forest: It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random.


Decision Trees, Random Forests & Gradient Boosting in R

#artificialintelligence

Would you like to build predictive models using machine learning? That s precisely what you will learn in this course "Decision Trees, Random Forests and Gradient Boosting in R." My name is Carlos Martínez, I have a Ph.D. in Management from the University of St. Gallen in Switzerland. I have presented my research at some of the most prestigious academic conferences and doctoral colloquiums at the University of Tel Aviv, Politecnico di Milano, University of Halmstad, and MIT. Furthermore, I have co-authored more than 25 teaching cases, some of them included in the case bases of Harvard and Michigan. This is a very comprehensive course that includes presentations, tutorials, and assignments. The course has a practical approach based on the learning-by-doing method in which you will learn decision trees and ensemble methods based on decision trees using a real dataset.


MOAI: A methodology for evaluating the impact of indoor airflow in the transmission of COVID-19

arXiv.org Machine Learning

Epidemiology models play a key role in understanding and responding to the COVID-19 pandemic. In order to build those models, scientists need to understand contributing factors and their relative importance. A large strand of literature has identified the importance of airflow to mitigate droplets and far-field aerosol transmission risks. However, the specific factors contributing to higher or lower contamination in various settings have not been clearly defined and quantified. As part of the MOAI project (https://moaiapp.com), we are developing a privacy-preserving test and trace app to enable infection cluster investigators to get in touch with patients without having to know their identity. This approach allows involving users in the fight against the pandemic by contributing additional information in the form of anonymous research questionnaires. We first describe how the questionnaire was designed, and the synthetic data was generated based on a review we carried out on the latest available literature. We then present a model to evaluate the risk exposition of a user for a given setting. We finally propose a temporal addition to the model to evaluate the risk exposure over time for a given user.


Classifying the Unstructured IT Service Desk Tickets Using Ensemble of Classifiers

arXiv.org Artificial Intelligence

Manual classification of IT service desk tickets may result in routing of the tickets to the wrong resolution group. Incorrect assignment of IT service desk tickets leads to reassignment of tickets, unnecessary resource utilization and delays the resolution time. Traditional machine learning algorithms can be used to automatically classify the IT service desk tickets. Service desk ticket classifier models can be trained by mining the historical unstructured ticket description and the corresponding label. The model can then be used to classify the new service desk ticket based on the ticket description. The performance of the traditional classifier systems can be further improved by using various ensemble of classification techniques. This paper brings out the three most popular ensemble methods ie, Bagging, Boosting and Voting ensemble for combining the predictions from different models to further improve the accuracy of the ticket classifier system. The performance of the ensemble classifier system is checked against the individual base classifiers using various performance metrics. Ensemble of classifiers performed well in comparison with the corresponding base classifiers. The advantages of building such an automated ticket classifier systems are simplified user interface, faster resolution time, improved productivity, customer satisfaction and growth in business. The real world service desk ticket data from a large enterprise IT infrastructure is used for our research purpose.


Individually Fair Gradient Boosting

arXiv.org Machine Learning

We consider the task of enforcing individual fairness in gradient boosting. Gradient boosting is a popular method for machine learning from tabular data, which arise often in applications where algorithmic fairness is a concern. At a high level, our approach is a functional gradient descent on a (distributionally) robust loss function that encodes our intuition of algorithmic fairness for the ML task at hand. Unlike prior approaches to individual fairness that only work with smooth ML models, our approach also works with non-smooth models such as decision trees. We show that our algorithm converges globally and generalizes. We also demonstrate the efficacy of our algorithm on three ML problems susceptible to algorithmic bias.


Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

arXiv.org Machine Learning

Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged -- one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of random forests use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that random forests with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of "double descent" in random forests by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.