AITopics | data amplification

Data Amplification: A Unified and Competitive Approach to Property Estimation

Neural Information Processing SystemsNov-20-2025, 22:46:42 GMT

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

data amplification, estimator, unified and competitive approach, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

An information theoretic limit to data amplification

Watts, S. J., Crow, L.

arXiv.org Machine LearningDec-23-2024

In recent years generative artificial intelligence has been used to create data to support science analysis. For example, Generative Adversarial Networks (GANs) have been trained using Monte Carlo simulated input and then used to generate data for the same problem. This has the advantage that a GAN creates data in a significantly reduced computing time. N training events for a GAN can result in GN generated events with the gain factor, G, being more than one. This appears to violate the principle that one cannot get information for free. This is not the only way to amplify data so this process will be referred to as data amplification which is studied using information theoretic concepts. It is shown that a gain of greater than one is possible whilst keeping the information content of the data unchanged. This leads to a mathematical bound which only depends on the number of generated and training events. This study determines conditions on both the underlying and reconstructed probability distributions to ensure this bound. In particular, the resolution of variables in amplified data is not improved by the process but the increase in sample size can still improve statistical significance. The bound is confirmed using computer simulation and analysis of GAN generated data from the literature.

artificial intelligence, entropy, machine learning, (19 more...)

arXiv.org Machine Learning

2412.18041

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Greater Manchester > Manchester (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Limits of Machine Learning for Automatic Vulnerability Detection

Risse, Niklas, Böhme, Marcel

arXiv.org Artificial IntelligenceJun-28-2023

Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function $f$, models trained by machine learning techniques can decide if $f$ contains a security flaw with up to 70% accuracy. But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model's accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels. In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose (i) a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set, and (ii) the amplification of the testing set with code snippets where the vulnerabilities are fixed. Using 11 transformations, 3 ML techniques, and 2 datasets, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data. Additionally, we find that the trained models are unable to generalize to the modified setting which requires to distinguish vulnerable functions from their patches.

artificial intelligence, machine learning, transformation, (14 more...)

arXiv.org Artificial Intelligence

2306.17193

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Germany (0.04)
(3 more...)

Genre: Research Report > New Finding (0.94)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.34)

Add feedback

Data Amplification: A Unified and Competitive Approach to Property Estimation

Hao, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong

Neural Information Processing SystemsFeb-14-2020, 20:26:22 GMT

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

artificial intelligence, estimator, machine learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.52)

Add feedback

Data Amplification: A Unified and Competitive Approach to Property Estimation

HAO, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong

Neural Information Processing SystemsDec-31-2018

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

artificial intelligence, estimator, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Data Amplification: A Unified and Competitive Approach to Property Estimation

HAO, Yi, Orlitsky, Alon, Suresh, Ananda Theertha, Wu, Yihong

Neural Information Processing SystemsDec-31-2018

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

artificial intelligence, estimator, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > San Diego County > La Jolla (0.04)
(5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Learning Many Related Tasks at the Same Time with Backpropagation

Caruana, Rich

Neural Information Processing SystemsDec-31-1995

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.

artificial intelligence, backprop, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.41)

Add feedback

Learning Many Related Tasks at the Same Time with Backpropagation

Caruana, Rich

Neural Information Processing SystemsDec-31-1995

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific information can be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multi task backprop generalizes better in real domains.

artificial intelligence, backprop, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.41)

Add feedback

Learning Many Related Tasks at the Same Time with Backpropagation

Caruana, Rich

Neural Information Processing SystemsDec-31-1995

Hinton [6] proposed that generalization in artificial neural nets should improve if nets learn to represent the domain's underlying regularities. Abu-Mustafa's hints work [1] shows that the outputs of a backprop net can be used as inputs through which domainspecific informationcan be given to the net. We extend these ideas by showing that a backprop net learning many related tasks at the same time can use these tasks as inductive bias for each other and thus learn better. We identify five mechanisms by which multitask backprop improves generalization and give empirical evidence that multitask backprop generalizes better in real domains.

artificial intelligence, backprop, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.41)

Add feedback