Performance Analysis
On the Limitations of Continual Learning for Malware Classification
Rahman, Mohammad Saidur, Coull, Scott E., Wright, Matthew
Malicious software (malware) classification offers a unique challenge for continual learning (CL) regimes due to the volume of new samples received on a daily basis and the evolution of malware to exploit new vulnerabilities. On a typical day, antivirus vendors receive hundreds of thousands of unique pieces of software, both malicious and benign, and over the course of the lifetime of a malware classifier, more than a billion samples can easily accumulate. Given the scale of the problem, sequential training using continual learning techniques could provide substantial benefits in reducing training and storage overhead. To date, however, there has been no exploration of CL applied to malware classification tasks. In this paper, we study 11 CL techniques applied to three malware tasks covering common incremental learning scenarios, including task, class, and domain incremental learning (IL). Specifically, using two realistic, large-scale malware datasets, we evaluate the performance of the CL methods on both binary malware classification (Domain-IL) and multi-class malware family classification (Task-IL and Class-IL) tasks. To our surprise, continual learning methods significantly underperformed naive Joint replay of the training data in nearly all settings -- in some cases reducing accuracy by more than 70 percentage points. A simple approach of selectively replaying 20% of the stored data achieves better performance, with 50% of the training time compared to Joint replay. Finally, we discuss potential reasons for the unexpectedly poor performance of the CL techniques, with the hope that it spurs further research on developing techniques that are more effective in the malware classification domain.
NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction
Ding, Yi, Rao, Avinash, Song, Hyebin, Willett, Rebecca, Hoffmann, Henry
Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers -- rare, yet extremely slow tasks -- are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels -- i.e., sufficient examples of all possible behaviors, including straggling and non-straggling -- or strong assumptions about the underlying latency distributions -- e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early without labeled positive examples or assumptions on latency distributions, this paper presents NURD, a novel Negative-Unlabeled learning approach with Reweighting and Distribution-compensation that only trains on negative and unlabeled streaming data. The key idea is to train a predictor using finished tasks of non-stragglers to predict latency for unlabeled running tasks, and then reweight each unlabeled task's prediction based on a weighting function of its feature space. We evaluate NURD on two production traces from Google and Alibaba, and find that compared to the best baseline approach, NURD produces 2--11 percentage point increases in the F1 score in terms of prediction accuracy, and 2.0--8.8 percentage point improvements in job completion time.
Succinct Differentiation of Disparate Boosting Ensemble Learning Methods for Prognostication of Polycystic Ovary Syndrome Diagnosis
Gupta, Abhishek, Shetty, Sannidhi, Joshi, Raunak, Laban, Ronald Melwin
The most common gynecological disorder affecting women globally is known as polycystic ovary syndrome (PCOS). The symptoms of PCOS include irregular periods, hirsutism, thinning hair and hair loss over head, oily skin or acne and weight gain. PCOS can lead to risk in later life with a lifelong situation that causes a person's blood sugar levels to promote type-II diabetes. High blood pressure and high cholesterol which can lead to heart stroke, overweight ladies may expand sleep apnoea, a situation that causes interrupted breathing at some stage in sleep. Around 10 - 15% of reproductive age (15 to 49 years) of women suffer from this. The monetary expenses of this disease and its comorbidities need the development of instruments and techniques so one can permit for early and precise identification. To cope with this problem this paper proposes a system for the early detection and prediction of PCOS from the most reliable and minimal and promising scientific and metabolic parameters, which is early detection for these diseases. Machine Learning[Shinde and Shah, 2018] can be leveraged to perform prognostication of PCOS that exigently extracts factual records from the given statistics considering the fact that machine learning is better known as glorified statistics. A specific type of machine learning algorithm that seeks to improve the overall performance by combining the predictions from more than one model which is a trendy meta method is known as an Ensemble Learning Approach.
SNGuess: A method for the selection of young extragalactic transients
Miranda, N., Freytag, J. C., Nordin, J., Biswas, R., Brinnel, V., Fremling, C., Kowalski, M., Mahabal, A., Reusch, S., van Santen, J.
With a rapidly rising number of transients detected in astronomy, classification methods based on machine learning are increasingly being employed. Their goals are typically to obtain a definitive classification of transients, and for good performance they usually require the presence of a large set of observations. However, well-designed, targeted models can reach their classification goals with fewer computing resources. This paper presents SNGuess, a model designed to find young extragalactic nearby transients with high purity. SNGuess works with a set of features that can be efficiently calculated from astronomical alert data. Some of these features are static and associated with the alert metadata, while others must be calculated from the photometric observations contained in the alert. Most of the features are simple enough to be obtained or to be calculated already at the early stages in the lifetime of a transient after its detection. We calculate these features for a set of labeled public alert data obtained over a time span of 15 months from the Zwicky Transient Facility (ZTF). The core model of SNGuess consists of an ensemble of decision trees, which are trained via gradient boosting. Approximately 88% of the candidates suggested by SNGuess from a set of alerts from ZTF spanning from April 2020 to August 2021 were found to be true relevant supernovae (SNe). For alerts with bright detections, this number ranges between 92% and 98%. Since April 2020, transients identified by SNGuess as potential young SNe in the ZTF alert stream are being published to the Transient Name Server (TNS) under the AMPEL_ZTF_NEW group identifier. SNGuess scores for any transient observed by ZTF can be accessed via a web service. The source code of SNGuess is publicly available.
Gradient Estimation for Binary Latent Variables via Gradient Variance Clipping
Kunes, Russell Z., Yin, Mingzhang, Land, Max, Haviv, Doron, Pe'er, Dana, Tavarรฉ, Simon
Gradient estimation is often necessary for fitting generative models with discrete latent variables, in contexts such as reinforcement learning and variational autoencoder (VAE) training. The DisARM estimator (Yin et al. 2020; Dong, Mnih, and Tucker 2020) achieves state of the art gradient variance for Bernoulli latent variable models in many contexts. However, DisARM and other estimators have potentially exploding variance near the boundary of the parameter space, where solutions tend to lie. To ameliorate this issue, we propose a new gradient estimator \textit{bitflip}-1 that has lower variance at the boundaries of the parameter space. As bitflip-1 has complementary properties to existing estimators, we introduce an aggregated estimator, \textit{unbiased gradient variance clipping} (UGC) that uses either a bitflip-1 or a DisARM gradient update for each coordinate. We theoretically prove that UGC has uniformly lower variance than DisARM. Empirically, we observe that UGC achieves the optimal value of the optimization objectives in toy experiments, discrete VAE training, and in a best subset selection problem.
Predicting Electricity Infrastructure Induced Wildfire Risk in California
Yao, Mengqi, Bharadwaj, Meghana, Zhang, Zheng, Jin, Baihong, Callaway, Duncan S.
This paper examines the use of risk models to predict the timing and location of wildfires caused by electricity infrastructure. Our data include historical ignition and wire-down points triggered by grid infrastructure collected between 2015 to 2019 in Pacific Gas & Electricity territory along with various weather, vegetation, and very high resolution data on grid infrastructure including location, age, materials. With these data we explore a range of machine learning methods and strategies to manage training data imbalance. The best area under the receiver operating characteristic we obtain is 0.776 for distribution feeder ignitions and 0.824 for transmission line wire-down events, both using the histogram-based gradient boosting tree algorithm (HGB) with under-sampling. We then use these models to identify which information provides the most predictive value. After line length, we find that weather and vegetation features dominate the list of top important features for ignition or wire-down risk. Distribution ignition models show more dependence on slow-varying vegetation variables such as burn index, energy release content, and tree height, whereas transmission wire-down models rely more on primary weather variables such as wind speed and precipitation. These results point to the importance of improved vegetation modeling for feeder ignition risk models, and improved weather forecasting for transmission wire-down models. We observe that infrastructure features make small but meaningful improvements to risk model predictive power.
Top Gear or Black Mirror: Inferring Political Leaning From Non-Political Content
Polarization and echo chambers are often studied in the context of explicitly political events such as elections, and little scholarship has examined the mixing of political groups in non-political contexts. A major obstacle to studying political polarization in non-political contexts is that political leaning (i.e., left vs right orientation) is often unknown. Nonetheless, political leaning is known to correlate (sometimes quite strongly) with many lifestyle choices leading to stereotypes such as the "latte-drinking liberal." We develop a machine learning classifier to infer political leaning from non-political text and, optionally, the accounts a user follows on social media. We use Voter Advice Application results shared on Twitter as our groundtruth and train and test our classifier on a Twitter dataset comprising the 3,200 most recent tweets of each user after removing any tweets with political text. We correctly classify the political leaning of most users (F1 scores range from 0.70 to 0.85 depending on coverage). We find no relationship between the level of political activity and our classification results. We apply our classifier to a case study of news sharing in the UK and discover that, in general, the sharing of political news exhibits a distinctive left-right divide while sports news does not.
Seeing your sleep stage: cross-modal distillation from EEG to infrared video
Han, Jianan, Zhang, Shaoxing, Men, Aidong, Liu, Yang, Yao, Ziming, Yan, Yan, Chen, Qingchao
It is inevitably crucial to classify sleep stage for the diagnosis of various diseases. However, existing automated diagnosis methods mostly adopt the "gold-standard" lectroencephalogram (EEG) or other uni-modal sensing signal of the PolySomnoGraphy (PSG) machine in hospital, that are expensive, importable and therefore unsuitable for point-of-care monitoring at home. To enable the sleep stage monitoring at home, in this paper, we analyze the relationship between infrared videos and the EEG signal and propose a new task: to classify the sleep stage using infrared videos by distilling useful knowledge from EEG signals to the visual ones. To establish a solid cross-modal benchmark for this application, we develop a new dataset termed as Seeing your Sleep Stage via Infrared Video and EEG ($S^3VE$). $S^3VE$ is a large-scale dataset including synchronized infrared video and EEG signal for sleep stage classification, including 105 subjects and 154,573 video clips that is more than 1100 hours long. Our contributions are not limited to datasets but also about a novel cross-modal distillation baseline model namely the structure-aware contrastive distillation (SACD) to distill the EEG knowledge to infrared video features. The SACD achieved the state-of-the-art performances on both our $S^3VE$ and the existing cross-modal distillation benchmark. Both the benchmark and the baseline methods will be released to the community. We expect to raise more attentions and promote more developments in the sleep stage classification and more importantly the cross-modal distillation from clinical signal/media to the conventional media.
Goodness of Fit Metrics for Multi-class Predictor
The multi-class prediction had gained popularity over recent years. Thus measuring fit goodness becomes a cardinal question that researchers often have to deal with. Several metrics are commonly used for this task. However, when one has to decide about the right measurement, he must consider that different use-cases impose different constraints that govern this decision. A leading constraint at least in \emph{real world} multi-class problems is imbalanced data: Multi categorical problems hardly provide symmetrical data. Hence, when we observe common KPIs (key performance indicators), e.g., Precision-Sensitivity or Accuracy, one can seldom interpret the obtained numbers into the model's actual needs. We suggest generalizing Matthew's correlation coefficient into multi-dimensions. This generalization is based on a geometrical interpretation of the generalized confusion matrix.
On deceiving malware classification with section injection
da Silva, Adeilson Antonio, Segundo, Mauricio Pamplona
We investigate how to modify executable files to deceive malware classification systems. This work's main contribution is a methodology to inject bytes across a malware file randomly and use it both as an attack to decrease classification accuracy but also as a defensive method, augmenting the data available for training. It respects the operating system file format to make sure the malware will still execute after our injection and will not change its behavior. We reproduced five state-of-the-art malware classification approaches to evaluate our injection scheme: one based on GIST+KNN, three CNN variations and one Gated CNN. We performed our experiments on a public dataset with 9,339 malware samples from 25 different families. Our results show that a mere increase of 7% in the malware size causes an accuracy drop between 25% and 40% for malware family classification. They show that a automatic malware classification system may not be as trustworthy as initially reported in the literature. We also evaluate using modified malwares alongside the original ones to increase networks robustness against mentioned attacks. Results show that a combination of reordering malware sections and injecting random data can improve overall performance of the classification. Code available at https://github.com/adeilsonsilva/malware-injection.