Goto

Collaborating Authors

 Performance Analysis


Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Information Security

arXiv.org Machine Learning

Despite the potential of Machine learning (ML) to learn the behavior of malware, detect novel malware samples, and significantly improve information security (InfoSec) we see few, if any, high-impact ML techniques in deployed systems, notwithstanding multiple reported successes in open literature. We hypothesize that the failure of ML in making high-impacts in InfoSec are rooted in a disconnect between the two communities as evidenced by a semantic gap---a difference in how executables are described (e.g. the data and features extracted from the data). Specifically, current datasets and representations used by ML are not suitable for learning the behaviors of an executable and differ significantly from those used by the InfoSec community. In this paper, we survey existing datasets used for classifying malware by ML algorithms and the features that are extracted from the data. We observe that: 1) the current set of extracted features are primarily syntactic, not behavioral, 2) datasets generally contain extreme exemplars producing a dataset in which it is easy to discriminate classes, and 3) the datasets provide significantly different representations of the data encountered in real-world systems. For ML to make more of an impact in the InfoSec community requires a change in the data (including the features and labels) that is used to bridge the current semantic gap. As a first step in enabling more behavioral analyses, we label existing malware datasets with behavioral features using open-source threat reports associated with malware families. This behavioral labeling alters the analysis from identifying intent (e.g. good vs bad) or malware family membership to an analysis of which behaviors are exhibited by an executable. We offer the annotations with the hope of inspiring future improvements in the data that will further bridge the semantic gap between the ML and InfoSec communities.


Ensemble Learning of Coarse-Grained Molecular Dynamics Force Fields with a Kernel Approach

arXiv.org Machine Learning

Gradient-domain machine learning (GDML) is an accurate and efficient approach to learn a molecular potential and associated force field based on the kernel ridge regression algorithm. Here, we demonstrate its application to learn an effective coarse-grained (CG) model from all-atom simulation data in a sample efficient manner. The coarse-grained force field is learned by following the thermodynamic consistency principle, here by minimizing the error between the predicted coarse-grained force and the all-atom mean force in the coarse-grained coordinates. Solving this problem by GDML directly is impossible because coarse-graining requires averaging over many training data points, resulting in impractical memory requirements for storing the kernel matrices. In this work, we propose a data-efficient and memory-saving alternative. Using ensemble learning and stratified sampling, we propose a 2-layer training scheme that enables GDML to learn an effective coarse-grained model. We illustrate our method on a simple biomolecular system, alanine dipeptide, by reconstructing the free energy landscape of a coarse-grained variant of this molecule. Our novel GDML training scheme yields a smaller free energy error than neural networks when the training set is small, and a comparably high accuracy when the training set is sufficiently large.


Tests in recovered patients found false positives, not reinfections, experts say

National Geographic

South Korea's infectious disease experts said Thursday that dead virus fragments were the likely cause of over 260 people here testing positive again for the novel coronavirus days and even weeks after marking full recoveries. Oh Myoung-don, who leads the central clinical committee for emerging disease control, said the committee members found little reason to believe that those cases could be COVID-19 reinfections or reactivations, which would have made global efforts to contain the virus much more daunting. "The tests detected the ribonucleic acid of the dead virus," said Oh, a Seoul National University hospital doctor, at a press conference Thursday held at the National Medical Center. He went on to explain that in PCR tests, or polymerase chain reaction tests, used for COVID-19 diagnosis, genetic materials of the virus amplify during testing, whether it is from a live virus or just from fragments of dead virus cells that can take months to clear from recovered patients. The PCR tests cannot distinguish whether the virus is alive or dead, he added, and this can lead to false positives.


Propensity Modelling - Using h2o and DALEX to Estimate the Likelihood of Purchasing a Financial Product - Part 3 of 3 - Optimise Profit With the Expected Value Framework ยท

#artificialintelligence

In this day and age, a business that leverages data to understand the drivers of its customers' behaviour has a true competitive advantage. Organisations can dramatically improve their performance in the market by analysing customer level data in an effective way and focus their efforts towards those that are more likely to engage. One trialled and tested approach to tease out this type of insight is Propensity Modelling, which combines information such as a customers' demographics (age, race, religion, gender, family size, ethnicity, income, education level), psycho-graphic (social class, lifestyle and personality characteristics), engagement (emails opened, emails clicked, searches on mobile app, webpage dwell time, etc.), user experience (customer service phone and email wait times, number of refunds, average shipping times), and user behaviour (purchase value on different time-scales, number of days since most recent purchase, time between offer and conversion, etc.) to estimate the likelihood of a certain customer profile to performing a certain type of behaviour (e.g. the purchase of a product). Once you understand the probability of a certain customer to interact with your brand, buy a product or a sign up for a service, you can use this information to create scenarios, be it minimising marketing expenditure, maximising acquisition targets, and optimise email send frequency or depth of discount. In this project I'm analysing the results of a bank direct marketing campaign to sell term deposits in order to identify what type of customer is more likely to respond. The marketing campaigns were based on phone calls and more than one contact to the same client was required at times.


Stochastic Neighbor Embedding of Multimodal Relational Data for Image-Text Simultaneous Visualization

arXiv.org Machine Learning

Multimodal relational data analysis has become of increasing importance in recent years, for exploring across different domains of data, such as images and their text tags obtained from social networking services (e.g., Flickr). A variety of data analysis methods have been developed for visualization; to give an example, t-Stochastic Neighbor Embedding (t-SNE) computes low-dimensional feature vectors so that their similarities keep those of the observed data vectors. However, t-SNE is designed only for a single domain of data but not for multimodal data; this paper aims at visualizing multimodal relational data consisting of data vectors in multiple domains with relations across these vectors. By extending t-SNE, we herein propose Multimodal Relational Stochastic Neighbor Embedding (MR-SNE), that (1) first computes augmented relations, where we observe the relations across domains and compute those within each of domains via the observed data vectors, and (2) jointly embeds the augmented relations to a low-dimensional space. Through visualization of Flickr and Animal with Attributes 2 datasets, proposed MR-SNE is compared with other graph embedding-based approaches; MR-SNE demonstrates the promising performance.


Thresholded Adaptive Validation: Tuning the Graphical Lasso for Graph Recovery

arXiv.org Machine Learning

The graphical lasso is the most popular estimator in Gaussian graphical models, but its performance hinges on a regularization parameter that needs to be calibrated to each application at hand. In this paper, we propose a novel calibration scheme for this parameter. The scheme is equipped with theoretical guarantees and motivates a thresholding pipeline that can improve graph recovery. Moreover, requiring at most one line search over the regularization path of the graphical lasso, the calibration scheme is computationally more efficient than competing schemes that are based on resampling. Finally, we show in simulations that our approach can improve on the graph recovery of other approaches considerably.


Automatic Catalog of RRLyrae from $\sim$ 14 million VVV Light Curves: How far can we go with traditional machine-learning?

arXiv.org Machine Learning

The creation of a 3D map of the bulge using RRLyrae (RRL) is one of the main goals of the VVV(X) surveys. The overwhelming number of sources under analysis request the use of automatic procedures. In this context, previous works introduced the use of Machine Learning (ML) methods for the variable star classification. Our goal is the development and analysis of an automatic procedure, based on ML, for the identification of RRLs in the VVV Survey. This procedure will be use to generate reliable catalogs integrated over several tiles in the survey. After the reconstruction of light-curves, we extract a set of period and intensity-based features. We use for the first time a new subset of pseudo color features. We discuss all the appropriate steps needed to define our automatic pipeline: selection of quality measures; sampling procedures; classifier setup and model selection. As final result, we construct an ensemble classifier with an average Recall of 0.48 and average Precision of 0.86 over 15 tiles. We also make available our processed datasets and a catalog of candidate RRLs. Perhaps most interestingly, from a classification perspective based on photometric broad-band data, is that our results indicate that Color is an informative feature type of the RRL that should be considered for automatic classification methods via ML. We also argue that Recall and Precision in both tables and curves are high quality metrics for this highly imbalanced problem. Furthermore, we show for our VVV data-set that to have good estimates it is important to use the original distribution more than reduced samples with an artificial balance. Finally, we show that the use of ensemble classifiers helps resolve the crucial model selection step, and that most errors in the identification of RRLs are related to low quality observations of some sources or to the difficulty to resolve the RRL-C type given the date.


An Imitation Game for Learning Semantic Parsers from User Interaction

arXiv.org Artificial Intelligence

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.


Learning to Complement Humans

arXiv.org Artificial Intelligence

A rising vision for AI in the open world centers on the development of systems that can complement humans for perceptual, diagnostic, and reasoning tasks. To date, systems aimed at complementing the skills of people have employed models trained to be as accurate as possible in isolation. We demonstrate how an end-to-end learning strategy can be harnessed to optimize the combined performance of human-machine teams by considering the distinct abilities of people and machines. The goal is to focus machine learning on problem instances that are difficult for humans, while recognizing instances that are difficult for the machine and seeking human input on them. We demonstrate in two real-world domains (scientific discovery and medical diagnosis) that human-machine teams built via these methods outperform the individual performance of machines and people. We then analyze conditions under which this complementarity is strongest, and which training methods amplify it. Taken together, our work provides the first systematic investigation of how machine learning systems can be trained to complement human reasoning.


The False Positives, False Negatives, and Positive Negatives of the Coronavirus

The New Yorker

As the coronavirus began to creep into our lives--but before it came to define them entirely--e-mails from across the world included the cheery phrase "Crazy times!" Messages from friends here in Los Angeles tended to favor something locally sourced, courtesy of Jim Morrison and the Doors: "Strange Days." Strange days, indeed, as we waited for the result of my wife's test, hoping that it would be positive. Can you think of any other illness for which a positive result might be eagerly anticipated? Unable to do anything but lie in bed, she experienced symptoms that were severe by any usual standard but most welcome by the newly enhanced metrics of affliction ushered in by the virus.