data amplification
Data Amplification: A Unified and Competitive Approach to Property Estimation
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.
An information theoretic limit to data amplification
In recent years generative artificial intelligence has been used to create data to support science analysis. For example, Generative Adversarial Networks (GANs) have been trained using Monte Carlo simulated input and then used to generate data for the same problem. This has the advantage that a GAN creates data in a significantly reduced computing time. N training events for a GAN can result in GN generated events with the gain factor, G, being more than one. This appears to violate the principle that one cannot get information for free. This is not the only way to amplify data so this process will be referred to as data amplification which is studied using information theoretic concepts. It is shown that a gain of greater than one is possible whilst keeping the information content of the data unchanged. This leads to a mathematical bound which only depends on the number of generated and training events. This study determines conditions on both the underlying and reconstructed probability distributions to ensure this bound. In particular, the resolution of variables in amplified data is not improved by the process but the increase in sample size can still improve statistical significance. The bound is confirmed using computer simulation and analysis of GAN generated data from the literature.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)
Limits of Machine Learning for Automatic Vulnerability Detection
Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function $f$, models trained by machine learning techniques can decide if $f$ contains a security flaw with up to 70% accuracy. But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model's accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels. In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose (i) a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set, and (ii) the amplification of the testing set with code snippets where the vulnerabilities are fixed. Using 11 transformations, 3 ML techniques, and 2 datasets, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data. Additionally, we find that the trained models are unable to generalize to the modified setting which requires to distinguish vulnerable functions from their patches.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Germany (0.04)
- (3 more...)
Data Amplification: A Unified and Competitive Approach to Property Estimation
Hao, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.
Data Amplification: A Unified and Competitive Approach to Property Estimation
HAO, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
Data Amplification: A Unified and Competitive Approach to Property Estimation
HAO, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong
Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- (5 more...)
Learning Many Related Tasks at the Same Time with Backpropagation
Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.
Learning Many Related Tasks at the Same Time with Backpropagation
Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.
Learning Many Related Tasks at the Same Time with Backpropagation
Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific informationcan be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multitask backprop generalizes better in real domains.