AITopics

Data Portraits: Recording Foundation Model Training Data

Neural Information Processing SystemsMay-28-2025, 19:34:20 GMT

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at dataportraits.org and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Generative Status Estimation and Information for Image Rain Removal Supplementary Material

Neural Information Processing SystemsMay-28-2025, 19:34:19 GMT

The feature masking contains two convolutional layers. It computes the rain (or object) feature map. There is a pair of batch normalization and ReLU layers between the adjacent convolutional layers. The size of kernels in each convolutional layer is 3 3. V There are 132 convolutional layers in SEIDNet. We use Adam solver to optimize the parameters of SEIDNet.

artificial intelligence, kernel, machine learning, (14 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Approximation-Aware Bayesian Optimization

Neural Information Processing SystemsMay-28-2025, 19:34:18 GMT

High-dimensional Bayesian optimization (BO) tasks such as molecular design often require >10,000 function evaluations before obtaining meaningful results. While methods like sparse variational Gaussian processes (SVGPs) reduce computational requirements in these settings, the underlying approximations result in suboptimal data acquisitions that slow the progress of optimization. In this paper we modify SVGPs to better align with the goals of BO: targeting informed data acquisition rather than global posterior fidelity. Using the framework of utility-calibrated variational inference, we unify GP approximation and data acquisition into a joint optimization problem, thereby ensuring optimal decisions under a limited computational budget. Our approach can be used with any decision-theoretic acquisition function and is readily compatible with trust region methods like TuRBO. We derive efficient joint objectives for the expected improvement and knowledge gradient acquisition functions for standard and batch BO. Our approach outperforms standard SVGPs on high-dimensional benchmark tasks in control and molecular design.

artificial intelligence, machine learning, optimization, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Generative Status Estimation and Information Decoupling for Image Rain Removal

Neural Information Processing SystemsMay-28-2025, 19:34:07 GMT

Image rain removal requires the accurate separation between the pixels of the rain streaks and object textures. But the confusing appearances of rains and objects lead to the misunderstanding of pixels, thus remaining the rain streaks or missing the object details in the result. In this paper, we propose SEIDNet equipped with the generative Status Estimation and Information Decoupling for rain removal. In the status estimation, we embed the pixel-wise statuses into the status space, where each status indicates a pixel of the rain or object. The status space allows sampling multiple statuses for a pixel, thus capturing the confusing rain or object. In the information decoupling, we respect the pixel-wise statuses, decoupling the appearance information of rain and object from the pixel. Based on the decoupled information, we construct the kernel space, where multiple kernels are sampled for the pixel to remove the rain and recover the object appearance. We evaluate SEIDNet on the public datasets, achieving state-of-the-art performances of image rain removal. The experimental results also demonstrate the generalization of SEIDNet, which can be easily extended to achieve state-of-the-art performances on other image restoration tasks (e.g., snow, haze, and shadow removal).

artificial intelligence, machine learning, seidnet, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.89)

Add feedback

Watermarking Makes Language Models Radioactive Tom Sander Pierre Fernandez

Neural Information Processing SystemsMay-28-2025, 19:33:51 GMT

We investigate the radioactivity of text generated by large language models (LLM), i.e., whether it is possible to detect that such synthetic input was used to train a subsequent LLM. Current methods like membership inference or active IP protection either work only in settings where the suspected text is known or do not provide reliable statistical guarantees. We discover that, on the contrary, it is possible to reliably determine if a language model was trained on synthetic data if that data is output by a watermarked LLM.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)
Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.45)

Add feedback

311257424b6d80e930fc93b224f0a63e-Paper-Conference.pdf

Neural Information Processing SystemsMay-28-2025, 19:33:44 GMT

approximation, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

48cb136b65a69e8c2aa22913a0d91b2f-Supplemental.pdf

Neural Information Processing SystemsMay-28-2025, 19:33:38 GMT

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

48cb136b65a69e8c2aa22913a0d91b2f-Paper.pdf

Neural Information Processing SystemsMay-28-2025, 19:33:33 GMT

data mining, facet, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England (0.14)

Genre: Research Report (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Mining (0.94)

Add feedback

291dbc18539ba7e19b8abb7d85aa204e-Supplemental.pdf

Neural Information Processing SystemsMay-28-2025, 19:33:24 GMT

artificial intelligence, machine learning, theorem 2, (20 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > Nebraska > Lancaster County > Lincoln (0.14)

Genre:

Research Report (0.48)
Overview (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning discrete distributions with infinite support

Neural Information Processing SystemsMay-28-2025, 19:33:16 GMT

We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. In a departure from the established paradigm, we make no structural assumptions whatsoever on the sampling distribution. In such a setting, distribution-free risk bounds are impossible, and the best one could hope for is a fully empirical data-dependent bound. We derive precisely such bounds, and demonstrate that these are, in a well-defined sense, the best possible. Our main discovery is that the half-norm of the empirical distribution provides tight upper and lower estimates on the empirical risk.

artificial intelligence, machine learning, theorem 2, (17 more...)

Neural Information Processing Systems

Country: