Goto

Collaborating Authors

 antipattern


Data Quality Antipatterns for Software Analytics

arXiv.org Artificial Intelligence

Background: Data quality is vital in software analytics, particularly for machine learning (ML) applications like software defect prediction (SDP). Despite the widespread use of ML in software engineering, the effect of data quality antipatterns on these models remains underexplored. Objective: This study develops a taxonomy of ML-specific data quality antipatterns and assesses their impact on software analytics models' performance and interpretation. Methods: We identified eight types and 14 sub-types of ML-specific data quality antipatterns through a literature review. We conducted experiments to determine the prevalence of these antipatterns in SDP data (RQ1), assess how cleaning order affects model performance (RQ2), evaluate the impact of antipattern removal on performance (RQ3), and examine the consistency of interpretation from models built with different antipatterns (RQ4). Results: In our SDP case study, we identified nine antipatterns. Over 90% of these overlapped at both row and column levels, complicating cleaning prioritization and risking excessive data removal. The order of cleaning significantly impacts ML model performance, with neural networks being more resilient to cleaning order changes than simpler models like logistic regression. Antipatterns such as Tailed Distributions and Class Overlap show a statistically significant correlation with performance metrics when other antipatterns are cleaned. Models built with different antipatterns showed moderate consistency in interpretation results. Conclusion: The cleaning order of different antipatterns impacts ML model performance. Five antipatterns have a statistically significant correlation with model performance when others are cleaned. Additionally, model interpretation is moderately affected by different data quality antipatterns.


Antipatterns in Software Classification Taxonomies

arXiv.org Artificial Intelligence

Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the `antipatterns' of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification.


Design patterns in machine learning - KDnuggets

#artificialintelligence

According to its definition, a design pattern is a reusable solution to a commonly occurring problem. In software engineering, the concept dates back to 1987 when Beck and Cunningham started to apply it to programming. By the 2000s, design patterns -- especially the SOLID design principles for OOP -- were considered common knowledge to programmers. Fast forward 15 years and we arrive at the era of Software 2.0: machine learning models start to replace classical functions in more and more places of code. Today, we look at software as a fusion of traditional code, machine learning models and the underlying data.


Using AntiPatterns to avoid MLOps Mistakes

#artificialintelligence

Different values of hyper-parameters often prove to be significant drivers of model performance and are expensive to tune and mostly task specific. Hyper-parameters play such a crucial role in modeling architectures that entire research efforts are devoted to developing efficient hyper-parameter search strategies (Bergstra et al., 2013; Nguyen et al., 2019; Henderson et al., 2018; Van Rijn and Hutter, 2018; Probst et al., 2019). The set of hyper-parameters differs for different learning algorithms. For instance, even a simple classification model like the decision tree classifier, has hyper-parameters like the maximum depth of the tree, the minimum number of samples to split an internal node and the criterion to use for estimating either the impurity at a node (gini) or the information gain (entropy) at each node. Ensemble models like random forest classifiers and gradient boosting machines also have additional parameters governing the number of estimators (trees) to include in the model.


Using Machine Learning in Testing and Maintenance

#artificialintelligence

With machine learning, we can reduce maintenance efforts and improve the quality of products. It can be used in various stages of the software testing life-cycle, including bug management, which is an important part of the chain. We can analyze large amounts of data for classifying, triaging, and prioritizing bugs in a more efficient way by means of machine learning algorithms. Mesut Durukal, a test automation engineer at Rapyuta Robotics, spoke at Aginext 2021 about using machine learning in testing. Durukal uses machine learning to classify and cluster bugs.


Characterizing Technical Debt and Antipatterns in AI-Based Systems: A Systematic Mapping Study

arXiv.org Artificial Intelligence

Background: With the rising popularity of Artificial Intelligence (AI), there is a growing need to build large and complex AI-based systems in a cost-effective and manageable way. Like with traditional software, Technical Debt (TD) will emerge naturally over time in these systems, therefore leading to challenges and risks if not managed appropriately. The influence of data science and the stochastic nature of AI-based systems may also lead to new types of TD or antipatterns, which are not yet fully understood by researchers and practitioners. Objective: The goal of our study is to provide a clear overview and characterization of the types of TD (both established and new ones) that appear in AI-based systems, as well as the antipatterns and related solutions that have been proposed. Method: Following the process of a systematic mapping study, 21 primary studies are identified and analyzed. Results: Our results show that (i) established TD types, variations of them, and four new TD types (data, model, configuration, and ethics debt) are present in AI-based systems, (ii) 72 antipatterns are discussed in the literature, the majority related to data and model deficiencies, and (iii) 46 solutions have been proposed, either to address specific TD types, antipatterns, or TD in general. Conclusions: Our results can support AI professionals with reasoning about and communicating aspects of TD present in their systems. Additionally, they can serve as a foundation for future research to further our understanding of TD in AI-based systems.


Language Models for Lexical Inference in Context

arXiv.org Artificial Intelligence

Lexical inference (LI) denotes the task of deciding Recently, transfer learning has become ubiquitous whether or not an entailment relation holds between in NLP; Transformer (Vaswani et al., two lexical items. It is therefore related to the detection 2017) language models (LMs) pretrained on large of other lexical relations like hyponymy amounts of textual data (Devlin et al., 2019a; Liu between nouns (Hearst, 1992), e.g., dog animal, et al., 2019) form the basis of a lot of current stateof-the-art or troponymy between verbs (Fellbaum and Miller, models. Besides zero-and few-shot capabilities 1990), e.g., to traipse to walk. Lexical inference (Radford et al., 2019; Brown et al., 2020), in context (LIiC) adds the problem of disambiguating pretrained LMs have also been found to acquire the pair of lexical items in a given context before factual and relational knowledge during pretraining reasoning about the inference question.


"Dumb intelligence" or getting wrong results with machine learning

#artificialintelligence

It's unsuto hear cats mentioned in a presentation about machine learning. But they actually have more in common than you would think. Ismail Elouafiq, a Data Scientist at SVT has drawn a genius association between machine learning systems. Nevertheless, Ismail's talk at Nordic Data Science and Machine Learning Summit is a great overview of a common problem that occurs in machine learning – machine learning antipatterns. "Imagine you are working in a cat hospital", starts Ismail, and you admitted 132 cats which are victims of jumping off the window.