Performance Analysis
On Wasted Contributions: Understanding the Dynamics of Contributor-Abandoned Pull Requests
Khatoonabadi, SayedHassan, Costa, Diego Elias, Abdalkareem, Rabe, Shihab, Emad
Pull-based development has enabled numerous volunteers to contribute to open-source projects with fewer barriers. Nevertheless, a considerable amount of pull requests (PRs) with valid contributions are abandoned by their contributors, wasting the effort and time put in by both the contributors and maintainers. To better understand the underlying dynamics of contributor-abandoned PRs, we conduct a mixed-methods study using both quantitative and qualitative methods. We curate a dataset consisting of 265,325 PRs including 4,450 abandoned ones from ten popular and mature GitHub projects and measure 16 features characterizing PRs, contributors, review processes, and projects. Using statistical and machine learning techniques, we find that complex PRs, novice contributors, and lengthy reviews have a higher probability of abandonment and the rate of PR abandonment fluctuates alongside the projects' maturity or workload. To identify why contributors abandon their PRs, we also manually examine a random sample of 354 abandoned PRs. We observe that the most frequent abandonment reasons are related to the obstacles faced by contributors, followed by the hurdles imposed by maintainers during the review process. Finally, we survey the top core maintainers of the studied projects to understand their perspectives on dealing with PR abandonment and on our findings.
Elon Musk's Twitter Bot Problem Is Fake News
With his professed concern about fake accounts on Twitter, Elon Musk appears to be grasping at legal straws in an attempt to back out of his commitment to buy the social networking company for $54.20 a share, or at least to pay less for it. But his gambit has shined a light on a real scourge of online companies and their users. Counting the autonomous accounts that mimic real people is just as slippery as valuing companies. A 2020 study by Adrian Rauchfleisch and Jonas Kaiser looking at thousands of Twitter accounts, including hundreds of verified politicians as well as "obvious" bots, found Botometer, the industry-standard learning algorithm trained to calculate the likelihood an account is a bot, yields imprecise scores leading to both false negatives and false positives.
Classification Auto-Encoder based Detector against Diverse Data Poisoning Attacks
Poisoning attacks are a category of adversarial machine learning threats in which an adversary attempts to subvert the outcome of the machine learning systems by injecting crafted data into training data set, thus increasing the machine learning model's test error. The adversary can tamper with the data feature space, data labels, or both, each leading to a different attack strategy with different strengths. Various detection approaches have recently emerged, each focusing on one attack strategy. The Achilles heel of many of these detection approaches is their dependence on having access to a clean, untampered data set. In this paper, we propose CAE, a Classification Auto-Encoder based detector against diverse poisoned data. CAE can detect all forms of poisoning attacks using a combination of reconstruction and classification errors without having any prior knowledge of the attack strategy. We show that an enhanced version of CAE (called CAE+) does not have to employ a clean data set to train the defense model. Our experimental results on three real datasets MNIST, Fashion-MNIST and CIFAR demonstrate that our proposed method can maintain its functionality under up to 30% contaminated data and help the defended SVM classifier to regain its best accuracy.
Decision Making for Hierarchical Multi-label Classification with Multidimensional Local Precision Rate
Ye, Yuting, Ho, Christine, Jiang, Ci-Ren, Lee, Wayne Tai, Huang, Haiyan
Hierarchical multi-label classification (HMC) has drawn increasing attention in the past few decades. It is applicable when hierarchical relationships among classes are available and need to be incorporated along with the multi-label classification whereby each object is assigned to one or more classes. There are two key challenges in HMC: i) optimizing the classification accuracy, and meanwhile ii) ensuring the given class hierarchy. To address these challenges, in this article, we introduce a new statistic called the multidimensional local precision rate (mLPR) for each object in each class. We show that classification decisions made by simply sorting objects across classes in descending order of their true mLPRs can, in theory, ensure the class hierarchy and lead to the maximization of CATCH, an objective function we introduce that is related to the area under a hit curve. This approach is the first of its kind that handles both challenges in one objective function without additional constraints, thanks to the desirable statistical properties of CATCH and mLPR. In practice, however, true mLPRs are not available. In response, we introduce HierRank, a new algorithm that maximizes an empirical version of CATCH using estimated mLPRs while respecting the hierarchy. The performance of this approach was evaluated on a synthetic data set and two real data sets; ours was found to be superior to several comparison methods on evaluation criteria based on metrics such as precision, recall, and $F_1$ score.
Quantitative Discourse Cohesion Analysis of Scientific Scholarly Texts using Multilayer Networks
Bhatnagar, Vasudha, Duari, Swagata, Gupta, S. K.
Discourse cohesion facilitates text comprehension and helps the reader form a coherent narrative. In this study, we aim to computationally analyze the discourse cohesion in scientific scholarly texts using multilayer network representation and quantify the writing quality of the document. Exploiting the hierarchical structure of scientific scholarly texts, we design section-level and document-level metrics to assess the extent of lexical cohesion in text. We use a publicly available dataset along with a curated set of contrasting examples to validate the proposed metrics by comparing them against select indices computed using existing cohesion analysis tools. We observe that the proposed metrics correlate as expected with the existing cohesion indices. We also present an analytical framework, CHIAA (CHeck It Again, Author), to provide pointers to the author for potential improvements in the manuscript with the help of the section-level and document-level metrics. The proposed CHIAA framework furnishes a clear and precise prescription to the author for improving writing by localizing regions in text with cohesion gaps. We demonstrate the efficacy of CHIAA framework using succinct examples from cohesion-deficient text excerpts in the experimental dataset.
Causal discovery for observational sciences using supervised machine learning
Petersen, Anne Helby, Ramsey, Joseph, Ekstrøm, Claus Thorn, Spirtes, Peter
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.
The "Hello World" of Tensorflow - KDnuggets
Tensorflow is an open-source end-to-end machine learning framework that makes it easy to train and deploy the model. It consists of two words - tensor and flow. A tensor is a vector or a multidimensional array that is a standard way of representing the data in deep learning models. Flow implies how the data moves through a graph by undergoing the operations called nodes. It is used for numerical computation and large-scale machine learning by bundling various algorithms together.
Improving Astronomical Time-series Classification via Data Augmentation with Generative Adversarial Networks
García-Jara, Germán, Protopapas, Pavlos, Estévez, Pablo A.
Due to the latest advances in technology, telescopes with significant sky coverage will produce millions of astronomical alerts per night that must be classified both rapidly and automatically. Currently, classification consists of supervised machine learning algorithms whose performance is limited by the number of existing annotations of astronomical objects and their highly imbalanced class distributions. In this work, we propose a data augmentation methodology based on Generative Adversarial Networks (GANs) to generate a variety of synthetic light curves from variable stars. Our novel contributions, consisting of a resampling technique and an evaluation metric, can assess the quality of generative models in unbalanced datasets and identify GAN-overfitting cases that the Fr\'echet Inception Distance does not reveal. We applied our proposed model to two datasets taken from the Catalina and Zwicky Transient Facility surveys. The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data with respect to the case of using only real data.
Machine Learning to Support Triage of Children at Risk for Epileptic Seizures in the Pediatric Intensive Care Unit
Azriel, Raphael, Hahn, Cecil D., De Cooman, Thomas, Van Huffel, Sabine, Payne, Eric T., McBain, Kristin L., Eytan, Danny, Behar, Joachim A.
Objective: Epileptic seizures are relatively common in critically-ill children admitted to the pediatric intensive care unit (PICU) and thus serve as an important target for identification and treatment. Most of these seizures have no discernible clinical manifestation but still have a significant impact on morbidity and mortality. Children that are deemed at risk for seizures within the PICU are monitored using continuous-electroencephalogram (cEEG). cEEG monitoring cost is considerable and as the number of available machines is always limited, clinicians need to resort to triaging patients according to perceived risk in order to allocate resources. This research aims to develop a computer aided tool to improve seizures risk assessment in critically-ill children, using an ubiquitously recorded signal in the PICU, namely the electrocardiogram (ECG). Approach: A novel data-driven model was developed at a patient-level approach, based on features extracted from the first hour of ECG recording and the clinical data of the patient. Main results: The most predictive features were the age of the patient, the brain injury as coma etiology and the QRS area. For patients without any prior clinical data, using one hour of ECG recording, the classification performance of the random forest classifier reached an area under the receiver operating characteristic curve (AUROC) score of 0.84. When combining ECG features with the patients clinical history, the AUROC reached 0.87. Significance: Taking a real clinical scenario, we estimated that our clinical decision support triage tool can improve the positive predictive value by more than 59% over the clinical standard.
Assessing Streamline Plausibility Through Randomized Iterative Spherical-Deconvolution Informed Tractogram Filtering
Hain, Antonia, Jörgens, Daniel, Moreno, Rodrigo
Tractography has become an indispensable part of brain connectivity studies. However, it is currently facing problems with reliability. In particular, a substantial amount of nerve fiber reconstructions (streamlines) in tractograms produced by state-of-the-art tractography methods are anatomically implausible. To address this problem, tractogram filtering methods have been developed to remove faulty connections in a postprocessing step. This study takes a closer look at one such method, \textit{Spherical-deconvolution Informed Filtering of Tractograms} (SIFT), which uses a global optimization approach to improve the agreement between the remaining streamlines after filtering and the underlying diffusion magnetic resonance imaging data. SIFT is not suitable to judge the plausibility of individual streamlines since its results depend on the size and composition of the surrounding tractogram. To tackle this problem, we propose applying SIFT to randomly selected tractogram subsets in order to retrieve multiple assessments for each streamline. This approach makes it possible to identify streamlines with very consistent filtering results, which were used as pseudo ground truths for training classifiers. The trained classifier is able to distinguish the obtained groups of plausible and implausible streamlines with accuracy above 80%. The software code used in the paper and pretrained weights of the classifier are distributed freely via the Github repository https://github.com/djoerch/randomised_filtering.