Goto

Collaborating Authors

 Regression


Next-Year Bankruptcy Prediction from Textual Data: Benchmark and Baselines

arXiv.org Artificial Intelligence

Models for bankruptcy prediction are useful in several real-world scenarios, and multiple research contributions have been devoted to the task, based on structured (numerical) as well as unstructured (textual) data. However, the lack of a common benchmark dataset and evaluation strategy impedes the objective comparison between models. This paper introduces such a benchmark for the unstructured data scenario, based on novel and established datasets, in order to stimulate further research into the task. We describe and evaluate several classical and neural baseline models, and discuss benefits and flaws of different strategies. In particular, we find that a lightweight bag-of-words model based on static in-domain word representations obtains surprisingly good results, especially when taking textual data from several years into account. These results are critically assessed, and discussed in light of particular aspects of the data and the task. All code to replicate the data and experimental results will be released.


Automatic detection of faults in race walking from a smartphone camera: a comparison of an Olympic medalist and university athletes

arXiv.org Artificial Intelligence

Automatic fault detection is a major challenge in many sports. In race walking, referees visually judge faults according to the rules. Hence, ensuring objectivity and fairness while judging is important. To address this issue, some studies have attempted to use sensors and machine learning to automatically detect faults. However, there are problems associated with sensor attachments and equipment such as a high-speed camera, which conflict with the visual judgement of referees, and the interpretability of the fault detection models. In this study, we proposed a fault detection system for non-contact measurement. We used pose estimation and machine learning models trained based on the judgements of multiple qualified referees to realize fair fault judgement. We verified them using smartphone videos of normal race walking and walking with intentional faults in several athletes including the medalist of the Tokyo Olympics. The validation results show that the proposed system detected faults with an average accuracy of over 90%. We also revealed that the machine learning model detects faults according to the rules of race walking. In addition, the intentional faulty walking movement of the medalist was different from that of university walkers. This finding informs realization of a more general fault detection model. The code and data are available at https://github.com/SZucchini/racewalk-aijudge.


An ensemble meta-estimator to predict source code testability

arXiv.org Artificial Intelligence

Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R2 of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.


Identifying and Overcoming Transformation Bias in Forecasting Models

arXiv.org Artificial Intelligence

Log and square root transformations of target variable are routinely used in forecasting models to predict future sales. These transformations often lead to better performing models. However, they also introduce a systematic negative bias (under-forecasting). In this paper, we demonstrate the existence of this bias, dive deep into its root cause and introduce two methods to correct for the bias. We conclude that the proposed bias correction methods improve model performance (by up to 50%) and make a case for incorporating bias correction in modeling workflow. We also experiment with `Tweedie' family of cost functions which circumvents the transformation bias issue by modeling directly on sales. We conclude that Tweedie regression gives the best performance so far when modeling on sales making it a strong alternative to working with a transformed target variable.


We Are in This Together: Quantifying Community Subjective Wellbeing and Resilience

arXiv.org Artificial Intelligence

The COVID-19 pandemic disrupted everyone's life across the world. In this work, we characterize the subjective wellbeing patterns of 112 cities across the United States during the pandemic prior to vaccine availability, as exhibited in subreddits corresponding to the cities. We quantify subjective wellbeing using positive and negative affect. We then measure the pandemic's impact by comparing a community's observed wellbeing with its expected wellbeing, as forecasted by time series models derived from prior to the pandemic.We show that general community traits reflected in language can be predictive of community resilience. We predict how the pandemic would impact the wellbeing of each community based on linguistic and interaction features from normal times \textit{before} the pandemic. We find that communities with interaction characteristics corresponding to more closely connected users and higher engagement were less likely to be significantly impacted. Notably, we find that communities that talked more about social ties normally experienced in-person, such as friends, family, and affiliations, were actually more likely to be impacted. Additionally, we use the same features to also predict how quickly each community would recover after the initial onset of the pandemic. We similarly find that communities that talked more about family, affiliations, and identifying as part of a group had a slower recovery.


What can we Learn by Predicting Accuracy?

arXiv.org Artificial Intelligence

This paper seeks to answer the following question: \textit{"What can we learn by predicting accuracy?"}. Indeed, classification is one of the most popular tasks in machine learning, and many loss functions have been developed to maximize this non-differentiable objective function. Unlike past work on loss function design, which was guided mainly by intuition and theory before being validated by experimentation, here we propose to approach this problem in the opposite way: we seek to extract knowledge by experimentation. This data-driven approach is similar to that used in physics to discover general laws from data. We used a symbolic regression method to automatically find a mathematical expression highly correlated with a linear classifier's accuracy. The formula discovered on more than 260 datasets of embeddings has a Pearson's correlation of 0.96 and a $r^2$ of 0.93. More interestingly, this formula is highly explainable and confirms insights from various previous papers on loss design. We hope this work will open new perspectives in the search for new heuristics leading to a deeper understanding of machine learning theory.


Survival Mixture Density Networks

arXiv.org Artificial Intelligence

Survival analysis, the art of time-to-event modeling, plays an important role in clinical treatment decisions. Recently, continuous time models built from neural ODEs have been proposed for survival analysis. However, the training of neural ODEs is slow due to the high computational complexity of neural ODE solvers. Here, we propose an efficient alternative for flexible continuous time models, called Survival Mixture Density Networks (Survival MDNs). Survival MDN applies an invertible positive function to the output of Mixture Density Networks (MDNs). While MDNs produce flexible real-valued distributions, the invertible positive function maps the model into the time-domain while preserving a tractable density. Using four datasets, we show that Survival MDN performs better than, or similarly to continuous and discrete time baselines on concordance, integrated Brier score and integrated binomial log-likelihood. Meanwhile, Survival MDNs are also faster than ODE-based models and circumvent binning issues in discrete models.


20 Most Asked Interview Questions of Machine Learning - Analytics Vidhya

#artificialintelligence

This article was published as a part of the Data Science Blogathon. Companies are trying to disrupt the technological and business market by introducing new and smart products and techniques in society by adopting new age-technologies like Artificial intelligence and Machine learning. Each organization is searching for well-talented and experienced people who can serve them on their demands. Today data scientists, data analysts, machine learning engineers, and computer vision engineers are more in-demand organizational roles. If you wish to apply and grab a job in the tech domain, it's crucial to know common machine learning interview questions that recruiters ask. The article covers some popular machine learning interview questions that will force you to think one step ahead of your knowledge, and you will like to encounter and achieve your dream job.


Scalable Hybrid Classification-Regression Solution for High-Frequency Nonintrusive Load Monitoring

arXiv.org Artificial Intelligence

Residential buildings with the ability to monitor and control their net-load (sum of load and generation) can provide valuable flexibility to power grid operators. We present a novel multiclass nonintrusive load monitoring (NILM) approach that enables effective net-load monitoring capabilities at high-frequency with minimal additional equipment and cost. The proposed machine learning based solution provides accurate multiclass state predictions while operating at a faster timescale (able to provide a prediction for each 60-Hz ac cycle used in US power grid) without relying on event-detection techniques. We also introduce an innovative hybrid classification-regression method that allows for the prediction of not only load on/off states via classification but also individual load operating power levels via regression. A test bed with eight residential appliances is used for validating the NILM approach. Results show that the overall method has high accuracy and, good scaling and generalization properties. Furthermore, the method is shown to have sufficient response time (within 160ms, corresponding to 10 ac cycles) to support building grid-interactive control at fast timescales relevant to the provision of grid frequency support services.


Anomaly Attribution with Likelihood Compensation

arXiv.org Artificial Intelligence

This paper addresses the task of explaining anomalous predictions of a black-box regression model. When using a black-box model, such as one to predict building energy consumption from many sensor measurements, we often have a situation where some observed samples may significantly deviate from their prediction. It may be due to a sub-optimal black-box model, or simply because those samples are outliers. In either case, one would ideally want to compute a ``responsibility score'' indicative of the extent to which an input variable is responsible for the anomalous output. In this work, we formalize this task as a statistical inverse problem: Given model deviation from the expected value, infer the responsibility score of each of the input variables. We propose a new method called likelihood compensation (LC), which is founded on the likelihood principle and computes a correction to each input variable. To the best of our knowledge, this is the first principled framework that computes a responsibility score for real valued anomalous model deviations. We apply our approach to a real-world building energy prediction task and confirm its utility based on expert feedback.