AITopics | dense checkpoint

Collaborating Authors

dense checkpoint

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

167bcf2af2cd08fcf75b932022db0311-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 11:16:27 GMT

dense checkpoint, moe jetpack, moe model, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Networks (0.68)
(3 more...)

Add feedback

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Neural Information Processing SystemsDec-24-2025, 02:43:44 GMT

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.38)

Add feedback

167bcf2af2cd08fcf75b932022db0311-Paper-Conference.pdf

Neural Information Processing SystemsOct-11-2025, 00:11:17 GMT

dense checkpoint, moe jetpack, moe model, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Networks (0.68)
(3 more...)

Add feedback

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Neural Information Processing SystemsMay-26-2025, 17:08:46 GMT

The sparsely activated mixture of experts (MoE) model presents an effective alternative to densely activated (dense) models, combining improved accuracy with computational efficiency. However, training MoE models from scratch requires extensive data and computational resources, a challenge that limits their widespread adoption. To address this, we introduce MoE Jetpack, a framework designed to fine-tune the abundant and easily accessible dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which initializes MoE models with dense checkpoints to accelerate convergence and enhance accuracy, minimizing the need for extensive pre-training; (2) the hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture to enhance fine-tuning performance and efficiency.Experimental results indicate that MoE Jetpack doubles the convergence speed and enhances accuracy by 2.8% on ImageNet-1K.

artificial intelligence, machine learning, moe jetpack, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.44)
Information Technology > Artificial Intelligence > Vision (0.40)

Add feedback

Llama 3 Meets MoE: Efficient Upcycling

Vavre, Aditya, He, Ethan, Liu, Dennis, Yan, Zijie, Yang, June, Tajbakhsh, Nima, Aithal, Ashwath

arXiv.org Artificial IntelligenceDec-13-2024

Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.09952

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.05)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Texas (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Komatsuzaki, Aran, Puigcerver, Joan, Lee-Thorp, James, Ruiz, Carlos Riquelme, Mustafa, Basil, Ainslie, Joshua, Tay, Yi, Dehghani, Mostafa, Houlsby, Neil

arXiv.org Artificial IntelligenceFeb-17-2023

Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.

checkpoint, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.05055

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback