AITopics | task descriptor

AtAtT!T" O!O" Al-to-AlE!E" E# E$ AtT!T" O!O"FFNAtGaT!T" O!O" GaGaE!E" E# E$ Al-to-Al(a(b(clllltetetete))) tetete((DCnnnnDTMoiomttttrsiiiiiaoooopsnnnnansbEttricnihbfe)o)urtmede rMoE

Neural Information Processing SystemsJun-19-2026, 19:55:14 GMT

The computational sparsity of Mixture-of-Experts (MoE) models enables sublinear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking payload efficiency, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashMoE achieves up to 9 higher GPU utilization, 6 lower latency, 5.7 higher throughput, and 4 better overlap efficiency compared to state-of-the-art baselines--despite using FP32 while baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (1.00)

Genre: Research Report > Experimental Study (0.46)

Industry:

Government > Regional Government (0.46)
Energy (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Aligning Validation with Deployment: Target-Weighted Cross-Validation for Spatial Prediction

Brenning, Alexander, Suesse, Thomas

arXiv.org Machine LearningApr-1-2026

Cross-validation (CV) is commonly used to estimate predictive risk when independent test data are unavailable. Its validity depends on the assumption that validation tasks are sampled from the same distribution as prediction tasks encountered during deployment. In spatial prediction and other settings with structured data, this assumption is frequently violated, leading to biased estimates of deployment risk. We propose Target-Weighted CV (TWCV), an estimator of deployment risk that accounts for discrepancies between validation and deployment task distributions, thus accounting for (1) covariate shift and (2) task-difficulty shift. We characterize prediction tasks by descriptors such as covariates and spatial configuration. TWCV assigns weights to validation losses such that the weighted empirical distribution of validation tasks matches the corresponding distribution over a target domain. The weights are obtained via calibration weighting, yielding an importance-weighted estimator that targets deployment risk. Since TWCV requires adequate coverage of the deployment distribution's support, we combine it with spatially buffered resampling that diversifies the task difficulty distribution. In a simulation study, conventional as well as spatial estimators exhibit substantial bias depending on sampling, whereas buffered TWCV remains approximately unbiased across scenarios. A case study in environmental pollution mapping further confirms that discrepancies between validation and deployment task distributions can affect performance assessment, and that buffered TWCV better reflects the prediction task over the target domain. These results establish task distribution mismatch as a primary source of CV bias in spatial prediction and show that calibration weighting combined with a suitable validation task generator provides a viable approach to estimating predictive risk under dataset shift.

artificial intelligence, machine learning, modeling & simulation, (17 more...)

arXiv.org Machine Learning

2603.29981

Country:

Europe > Germany (0.14)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (0.66)
Law (0.48)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (0.62)

Add feedback

ef0d17b3bdb4ee2aa741ba28c7255c53-Supplemental.pdf

Neural Information Processing SystemsFeb-11-2026, 00:53:00 GMT

control signal, international conference, latent variable, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ef0d17b3bdb4ee2aa741ba28c7255c53-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 00:52:53 GMT

algorithm, descriptor, task parameter, (16 more...)

Neural Information Processing Systems

Country: North America > Canada (0.04)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Gradient Episodic Memory for Continual Learning

David Lopez-Paz, Marc'Aurelio Ranzato

Neural Information Processing SystemsNov-21-2025, 13:57:16 GMT

One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge.

artificial intelligence, learning, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Industry:

Education (0.94)
Health & Medicine > Consumer Health (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

FlashMoE: Fast Distributed MoE in a Single Kernel

Aimuyo, Osayamen Jonathan, Oh, Byungsoo, Singh, Rachee

arXiv.org Artificial IntelligenceNov-11-2025

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.04667

Country: North America > United States > California (0.28)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A with Gaussian processes

Neural Information Processing SystemsAug-17-2025, 04:56:52 GMT

This section details how P AML can be combined with Gaussian processes, as in our experiments. Alternatively, one can use other probabilistic methods, e.g., Bayesian Neural Networks [1]. Secondly, it enables mini-batch training for further improvement in computational efficiency. During the evaluation, we compute the errors with respect to the normalized outputs, since the observed environments' state representations include dimensions of differing We use control signals that alternate back and forth from one end of the range to the other to generate trajectories. This policy resulted in better coverage of the state-space, compared to a random walk.

artificial intelligence, gaussian process, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ef0d17b3bdb4ee2aa741ba28c7255c53-Paper.pdf

Neural Information Processing SystemsAug-17-2025, 04:56:45 GMT

artificial intelligence, latexit sha1, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network

Hsiao, Vincent, Roberts, Mark, Hiatt, Laura M., Konidaris, George, Nau, Dana

arXiv.org Artificial IntelligenceFeb-21-2025

A major challenge for reinforcement learning is automatically generating curricula to reduce training time or improve performance in some target task. We introduce SEBNs (Skill-Environment Bayesian Networks) which model a probabilistic relationship between a set of skills, a set of goals that relate to the reward structure, and a set of environment features to predict policy performance on (possibly unseen) tasks. We develop an algorithm that uses the inferred estimates of agent success from SEBN to weigh the possible next tasks by expected improvement. We evaluate the benefit of the resulting curriculum on three environments: a discrete gridworld, continuous control, and simulated robotics. The results show that curricula constructed using SEBN frequently outperform other baselines.

agent, competency, curriculum, (15 more...)

arXiv.org Artificial Intelligence

2502.15662

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > District of Columbia > Washington (0.04)
North America > United States > Michigan > Wayne County > Detroit (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Education (1.00)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

Liskavets, Barys, Roy, Shuvendu, Ushakov, Maxim, Klibanov, Mark, Etemad, Ali, Luke, Shane

arXiv.org Artificial IntelligenceFeb-18-2025

The rise of Large Language Models (LLMs) has led to significant interest in prompt compression, a technique aimed at reducing the length of input prompts while preserving critical information. However, the prominent approaches in prompt compression often require explicit questions or handcrafted templates for compression, limiting their generalizability. We propose Task-agnostic Prompt Compression (TPC), a novel framework that generalizes compression across tasks and domains without requiring input questions or templates. TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs, and fine-tuned via reinforcement learning with a reward function designed to capture the most relevant information. The task descriptor is then utilized to compute the relevance of each sentence in the prompt to generate the compressed prompt. We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks, and our smallest model performs comparable to the existing solutions while being considerably smaller.

compression, dataset, task description, (12 more...)

arXiv.org Artificial Intelligence

2502.13374

Country: