Regression
Better Technical Debt Detection via SURVEYing
Fahid, Fahmid M., Yu, Zhe, Menzies, Tim
Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression estimation method that implements cost/benefit analysis. Some classifier is used to rank the as-yet-unvisited examples according to how interesting they might be. Humans then review the most interesting examples, after which their feedback is used to update an estimator for estimating how many examples are remaining. This paper evaluates SURVEY0 in the context of self-admitted technical debt. As software project mature, they can accumulate "technical debt" i.e. developer decisions which are sub-optimal and decrease the overall quality of the code. Such decisions are often commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that text classifiers can automatically detect such debt. We find that we can significantly outperform prior results by SURVEYing the data. Specifically, for ten open-source JAVA projects, we can find 83% of the technical debt via SURVEY0 using just 16% of the comments (and if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort).
Accelerated Discovery of Sustainable Building Materials
Ge, Xiou, Goodwin, Richard T., Gregory, Jeremy R., Kirchain, Randolph E., Maria, Joana, Varshney, Lav R.
Concrete is the most widely used engineered material in the world with more than 10 billion tons produced annually. Unfortunately, with that scale comes a significant burden in terms of energy, water, and release of greenhouse gases and other pollutants. As such, there is interest in creating concrete formulas that minimize this environmental burden, while satisfying engineering performance requirements. Recent advances in artificial intelligence have enabled machines to generate highly plausible artifacts, such as images of realistic looking faces. Semi-supervised generative models allow generation of artifacts with specific, desired characteristics. In this work, we use Conditional Variational Autoencoders (CVAE), a type of semi-supervised generative model, to discover concrete formulas with desired properties. Our model is trained using open data from the UCI Machine Learning Repository joined with environmental impact data computed using a web-based tool. We demonstrate CVAEs can design concrete formulas with lower emissions and natural resource usage while meeting design requirements. To ensure fair comparison between extant and generated formulas, we also train regression models to predict the environmental impacts and strength of discovered formulas. With these results, a construction engineer may create a formula that meets structural needs and best addresses local environmental concerns.
Digital Medicine: A Primer on Measurement
Technology is changing how we practice medicine. Sensors and wearables are getting smaller and cheaper, and algorithms are becoming powerful enough to predict medical outcomes. Yet despite rapid advances, healthcare lags behind other industries in truly putting these technologies to use. A major barrier to entry is the cross-disciplinary approach required to create such tools, requiring knowledge from many people across many fields. We aim to drive the field forward by unpacking that barrier, providing a brief introduction to core concepts and terms that define digital medicine. Specifically, we contrast "clinical research" versus routine "clinical care," outlining the security, ethical, regulatory, and legal issues developers must consider as digital medicine products go to market. We classify types of digital measurements and how to use and validate these measures in different settings. To make this resource engaging and accessible, we have included illustrations and figures ...
Prediction of Construction Cost for Field Canals Improvement Projects in Egypt
Field canals improvement projects (FCIPs) are one of the ambitious projects constructed to save fresh water. To finance this project, Conceptual cost models are important to accurately predict preliminary costs at the early stages of the project. The first step is to develop a conceptual cost model to identify key cost drivers affecting the project. Therefore, input variables selection remains an important part of model development, as the poor variables selection can decrease model precision. The study discovered the most important drivers of FCIPs based on a qualitative approach and a quantitative approach. Subsequently, the study has developed a parametric cost model based on machine learning methods such as regression methods, artificial neural networks, fuzzy model and case-based reasoning.
Sparse Transfer Learning via Winning Lottery Tickets
The recently proposed Lottery Ticket Hypothesis of Frankle and Carbin (2019) suggests that the performance of over-parameterized deep networks is due to the random initialization seeding the network with a small fraction of favorable weights. These weights retain their dominant status throughout training -- in a very real sense, this sub-network "won the lottery" during initialization. The authors find sub-networks via unstructured magnitude pruning with 85-95% of parameters removed that train to the same accuracy as the original network at a similar speed, which they call winning tickets. In this paper, we extend the Lottery Ticket Hypothesis to a variety of transfer learning tasks. We show that sparse sub-networks with approximately 90-95% of weights removed achieve (and often exceed) the accuracy of the original dense network in several realistic settings. We experimentally validate this by transferring the sparse representation found via pruning on CIFAR-10 to SmallNORB and FashionMNIST for object recognition tasks.
Enterprise AI: Diving into Machine Learning
Data in the real world, of course, isn't as simple as it is in the previous example. There are always complexities and nuances to data. To stick with our housing market example, the value of houses might also be influenced by dwelling type, lot size, recent upgrades, proximity to a neighborhood park and intangible variables like curbside appeal. And, in the real world, houses wouldn't all be in the same neighborhood, so your machine learning model must also consider the ZIP code for the property. To consider this wider range of variables, we need to dig deeper into the data scientist's toolbox and pull out some more sophisticated machine learning methods, including random forests and gradient boosting.
Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees
Devlin, Summer, Singh, Chandan, Murdoch, W. James, Yu, Bin
Tree ensembles, such as random forests and AdaBoost, are ubiquitous machine learning models known for achieving strong predictive performance across a wide variety of domains. However, this strong performance comes at the cost of interpretability (i.e. users are unable to understand the relationships a trained random forest has learned and why it is making its predictions). In particular, it is challenging to understand how the contribution of a particular feature, or group of features, varies as their value changes. To address this, we introduce Disentangled Attribution Curves (DAC), a method to provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, DAC plots the importance of a variable(s) as their value changes. We validate DAC on real data by showing that the curves can be used to increase the accuracy of logistic regression while maintaining interpretability, by including DAC as an additional feature. In simulation studies, DAC is shown to out-perform competing methods in the recovery of conditional expectations. Finally, through a case-study on the bike-sharing dataset, we demonstrate the use of DAC to uncover novel insights into a dataset.
Gradient tree boosting with random output projections for multi-label classification and multi-output regression
Joly, Arnaud, Wehenkel, Louis, Geurts, Pierre
Multi-output supervised learning aims to model input-output relationships from observations of inputoutput pairs whenever the output space is a vector of random variables. Multi-output classification and regression tasks have numerous applications in domains ranging from biology to multimedia, and recent applications in this area correspond to very high dimensional output spaces (Agrawal et al, 2013; Dekel and Shamir, 2010). Classification and regression trees (Breiman et al, 1984) are popular supervised learning methods that provide state-of-the-art performance when exploited in the context of ensemble methods, namely Random forests (Breiman, 2001; Geurts et al, 2006) and Boosting (Freund and Schapire, 1997; Friedman, 2001). Classification and regression trees can obviously be exploited to handle multi-output problems. The most straightforward way to address multi-output tasks is to apply standard single output methods separately and independently on each output. Although simple, this method, called binary relevance (Tsoumakas et al, 2009) in multi-label classification or single target (Spyromitros-Xioufis et al, 2012) in multi-output regression is often suboptimal as it does not exploit potential correlations that might exist between the outputs. Tree ensemble methods have however been explicitely extended by several authors to the joint prediction of multiple outputs (e.g., Segal, 1992; Blockeel et al, 2000). These extensions build a single tree to predict all outputs at once. They adapt the score measure used to assess splits during the tree growth to take into account all outputs and label each tree leaf with a vector of values, one for each output.
Horseshoe priors
Regularization is a fascinating topic, that puzzles me for a long time. First introduced in a machine learning course as a given, it always raised a question why it works. Then I started uncover a connection of regularization to the statistical properties of the underlying model. Indeed, if we consider linear regression model, it is easy to show, that L2 regularization is equivalent to adding Gaussian noise to the input. In fact, the latter is preferred if we consider feature interactions (or we have to use a non-trivial Tikhonov Matrix, e.g.
Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects
Guan, Zoe, Parmigiani, Giovanni, Patil, Prasad
A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) cross-study learning, which involves training a separate learner on each dataset and combining the resulting predictions. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than cross-study learning when the predictor-outcome relationships are relatively homogeneous across studies. However, as heterogeneity increases, there exists a transition point beyond which cross-study learning outperforms merging. We provide analytic expressions for the transition point in various scenarios and study asymptotic properties.