Goto

Collaborating Authors

 hammoudeh


Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees

Brophy, Jonathan, Hammoudeh, Zayd, Lowd, Daniel

arXiv.org Artificial Intelligence

Influence estimation analyzes how changes to the training data can lead to different model predictions; this analysis can help us better understand these predictions, the models making those predictions, and the data sets they're trained on. However, most influence-estimation techniques are designed for deep learning models with continuous parameters. Gradient-boosted decision trees (GBDTs) are a powerful and widely-used class of models; however, these models are black boxes with opaque decision-making processes. In the pursuit of better understanding GBDT predictions and generally improving these models, we adapt recent and popular influence-estimation methods designed for deep learning models to GBDTs. Specifically, we adapt representer-point methods and TracIn, denoting our new methods TREX and BoostIn, respectively; source code is available at https://github.com/jjbrophy47/tree_influence. We compare these methods to LeafInfluence and other baselines using 5 different evaluation measures on 22 real-world data sets with 4 popular GBDT implementations. These experiments give us a comprehensive overview of how different approaches to influence estimation work in GBDT models. We find BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster. Our evaluation also suggests the gold-standard approach of leave-one-out (LOO) retraining consistently identifies the single-most influential training example but performs poorly at finding the most influential set of training examples for a given target prediction.


A Reflection on Learning from Data: Epistemology Issues and Limitations

Hammoudeh, Ahmad, Tedmori, Sara, Obeid, Nadim

arXiv.org Artificial Intelligence

Although learning from data is effective and has achieved significant milestones, it has many challenges and limitations. Learning from data starts from observations and then proceeds to broader generalizations. This framework is controversial in science, yet it has achieved remarkable engineering successes. This paper reflects on some epistemological issues and some of the limitations of the knowledge discovered in data. The document discusses the common perception that getting more data is the key to achieving better machine learning models from theoretical and practical perspectives. The paper sheds some light on the shortcomings of using generic mathematical theories to describe the process. It further highlights the need for theories specialized in learning from data. While more data leverages the performance of machine learning models in general, the relation in practice is shown to be logarithmic at its best; After a specific limit, more data stabilize or degrade the machine learning models. Recent work in reinforcement learning showed that the trend is shifting away from data-oriented approaches and relying more on algorithms. The paper concludes that learning from data is hindered by many limitations. Hence an approach that has an intensional orientation is needed.