Collaborating Authors

Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization - Semantic Scholar


Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected"just in case"would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability.

Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning Machine Learning

Due to its high computational speed and accuracy compared to ab-initio quantum chemistry and forcefield modeling, the prediction of molecular properties using machine learning has received great attention in the fields of materials design and drug discovery. A main ingredient required for machine learning is a training dataset consisting of molecular features\textemdash for example fingerprint bits, chemical descriptors, etc. that adequately characterize the corresponding molecules. However, choosing features for any application is highly non-trivial. No "universal" method for feature selection exists. In this work, we propose a data fusion framework that uses Independent Vector Analysis to exploit underlying complementary information contained in different molecular featurization methods, bringing us a step closer to automated feature generation. Our approach takes an arbitrary number of individual feature vectors and automatically generates a single, compact (low dimensional) set of molecular features that can be used to enhance the prediction performance of regression models. At the same time our methodology retains the possibility of interpreting the generated features to discover relationships between molecular structures and properties. We demonstrate this on the QM7b dataset for the prediction of several properties such as atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. In addition, we show how our method helps improve the prediction of experimental binding affinities for a set of human BACE-1 inhibitors. To encourage more widespread use of IVA we have developed the PyIVA Python package, an open source code which is available for download on Github.

Comment on "Predicting reaction performance in C-N cross-coupling using machine learning"


Ahneman et al. (Reports, 13 April 2018) applied machine learning models to predict C–N cross-coupling reaction yields. The models use atomic, electronic, and vibrational descriptors as input features. However, the experimental design is insufficient to distinguish models trained on chemical features from those trained solely on random-valued features in retrospective and prospective test scenarios, thus failing classical controls in machine learning. A recent report by Ahneman et al. (1) describes a machine learning approach for modeling chemical reactions with data collected through ultrahigh-throughput experimentation. The Buchwald-Hartwig coupling (2) is used as a model reaction, with a Glorius interference approach (3) to study reaction poisoning by isoxazole additives.

How to Get Feature Importances from Any Sklearn Pipeline


I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I'll walk through how to access individual feature names and their coefficients from a Pipeline. Let's start with a super simple pipeline that applies a single featurization step followed by a classifier. Here we use the excellent datasets python package to quickly access the imdb sentiment data.

Correlation Coefficient in Flat Line Model


Let say that your model is Y a bX, (for instance X is the time) but you know that b 0. In short, you are trying to get the best fit for Y a. Of course a is your average computed on your observations in that case.