AITopics | distillation

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

large language model, machine learning, natural language, (22 more...)

arXiv.org Machine Learning

2605.09986

Country:

North America > United States (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Elon Musk Seemingly Admits xAI Has Used OpenAI's Models to Train Its Own

WIREDApr-30-2026, 17:41:14 GMT

Elon Musk Seemingly Admits xAI Has Used OpenAI's Models to Train Its Own While answering questions under oath, Musk argued it's standard practice for AI labs to use their competitors' models. While testifying on Thursday in federal court, Elon Musk seemed to indicate that his AI lab may have used OpenAI's models to train xAI's own. He touched upon the topic while sitting on the witness stand answering cross-examination questions from an OpenAI attorney amid his ongoing legal battle against the ChatGPT-maker . Do you know what distillation is? It means to use one AI model to train another AI model.

large language model, machine learning, natural language, (17 more...)

WIRED

Country: North America > United States > California (0.30)

Industry:

Law > Litigation (1.00)
Government > Regional Government > North America Government > United States Government (0.96)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

Fair Graph Distillation

Neural Information Processing SystemsApr-30-2026, 10:51:57 GMT

As graph neural networks (GNNs) struggle with large-scale graphs due to high computational demands, graph data distillation promises to alleviate this issue by distilling a large real graph into a smaller distilled graph while maintaining comparable prediction performance for GNNs trained on both graphs. However, we observe that GNNs trained on distilled graphs may exhibit more severe group fairness issues than GNNs trained on real graphs for vanilla and fair GNNs training. Motivated by these observations, we propose fair graph distillation (FGD), an advanced graph distillation approach to generate fair distilled graphs. The challenge lies in the deficiency of sensitive attributes for nodes in the distilled graph, making most debiasing methods (e.g., regularization and adversarial debiasing) intractable for distilled graphs. We develop a simple yet effective bias metric, named coherence, for distilled graphs. Based on the proposed coherence metric, we introduce a framework for fair graph distillation using a bi-level optimization algorithm. Extensive experiments demonstrate that the proposed algorithm can achieve better prediction performance-fairness trade-offs across various datasets and GNN architectures.

artificial intelligence, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Diff-Instruct: AUniversal Approach for Transferring Knowledge From Pre-trained Diffusion Models

Neural Information Processing SystemsApr-30-2026, 06:40:33 GMT

Due to the ease of training, ability to scale, and high sample quality, diffusion models (DMs) have become the preferred option for generative modeling, with numerous pre-trained models available for a wide variety of datasets. Containing intricate information about data distributions, pre-trained DMs are valuable assets for downstream applications. In this work, we consider learning from pre-trained DMs and transferring their knowledge to other generative models in a data-free fashion. Specifically, we propose a general framework called Diff-Instruct to instruct the training of arbitrary generative models as long as the generated samples are differentiable with respect to the model parameters. Our proposed Diff-Instruct is built on a rigorous mathematical foundation where the instruction process directly corresponds to minimizing a novel divergence we call Integral Kullback-Leibler (IKL) divergence.

diff-instruct, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: Asia (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Knowledge Distillation Performs Partial Variance Reduction

Neural Information Processing SystemsApr-30-2026, 05:37:23 GMT

Knowledge distillation is a popular approach for enhancing the performance of "student" models, with lower representational capacity, by taking advantage of more powerful "teacher" models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of partial variance reduction, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the "teacher" model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.

artificial intelligence, distillation, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

ddbbcd937d63d5c6b935c07b1a8222ec-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 00:36:08 GMT

artificial intelligence, data quality, machine learning, (21 more...)

Neural Information Processing Systems

Genre:

Research Report (0.46)
Overview (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

ddbbcd937d63d5c6b935c07b1a8222ec-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 00:36:03 GMT

data quality, dimension, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Data Science > Data Quality > Data Transformation (0.68)

Add feedback

dc9544b26ad3579477e567588db18cfc-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 00:07:52 GMT

data mining, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(3 more...)

Add feedback

dbc8ce0fdfcd55172d73fb05dbae07fc-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 23:57:35 GMT

distillation, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Distance OP

Neural Information Processing SystemsApr-29-2026, 23:57:32 GMT

Conventional KD methods propose various designs to allow student model to imitate the teacher better. However, these MultiScale handcrafted KD designs heavily rely on expert knowledge and may be sub-optimal for various teacher-student pairs.

distillation, evolutionary algorithm, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.71)
Information Technology > Artificial Intelligence > Cognitive Science (0.70)
(3 more...)

Add feedback

Filters

Collaborating Authors

distillation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

Elon Musk Seemingly Admits xAI Has Used OpenAI's Models to Train Its Own

Fair Graph Distillation

Diff-Instruct: AUniversal Approach for Transferring Knowledge From Pre-trained Diffusion Models

Knowledge Distillation Performs Partial Variance Reduction

ddbbcd937d63d5c6b935c07b1a8222ec-Supplemental-Conference.pdf

ddbbcd937d63d5c6b935c07b1a8222ec-Paper-Conference.pdf

dc9544b26ad3579477e567588db18cfc-Paper-Conference.pdf

dbc8ce0fdfcd55172d73fb05dbae07fc-Supplemental-Conference.pdf

Distance OP