AITopics | Education

Collaborating Authors

Education

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Neural Information Processing SystemsSep-30-2025, 08:42:58 GMT

We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates. Finally, we translate our new Hedge algorithm into a new adaptive boosting algorithm that is computationally faster as shown in experiments, since it ignores a large number of examples on each round.

algorithm, drifting-game analysis, online learning and application, (4 more...)

Neural Information Processing Systems

Industry: Education > Educational Setting > Online (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.98)

Add feedback

Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning

Neural Information Processing SystemsSep-30-2025, 08:26:25 GMT

We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.

asynchronous, delay-tolerant algorithm, name change, (4 more...)

Neural Information Processing Systems

Industry: Education > Educational Setting > Online (0.44)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Feng, Tao, Zhang, Haozhen, Lei, Zijie, Han, Pengrui, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan, You, Jiaxuan

arXiv.org Artificial IntelligenceSep-30-2025

The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B--671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.1054

Country:

North America > United States > Colorado (0.04)
North America > United States > New York (0.04)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
(7 more...)

Genre:

Personal > Honors (0.48)
Research Report > Promising Solution (0.45)

Industry:

Leisure & Entertainment > Sports > Football (1.00)
Education (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Learning in an Echo Chamber: Online Learning with Replay Adversary

Dmitriev, Daniil, Franck, Harald Eskelund, Heinzler, Carolin, Sanyal, Amartya

arXiv.org Machine LearningSep-30-2025

As machine learning systems increasingly train on self-annotated data, they risk reinforcing errors and becoming echo chambers of their own beliefs. We model this phenomenon by introducing a learning-theoretic framework: Online Learning in the Replay Setting. In round $t$, the learner outputs a hypothesis $\hat{h}_t$; the adversary then reveals either the true label $f^\ast(x_t)$ or a replayed label $\hat{h}_i(x_t)$ from an earlier round $i < t$. A mistake is counted only when the true label is shown, yet classical algorithms such as the SOA or the halving algorithm are easily misled by the replayed errors. We introduce the Extended Threshold dimension, $\mathrm{ExThD}(\mathcal{H})$, and prove matching upper and lower bounds that make $\mathrm{ExThD}(\mathcal{H})$ the exact measure of learnability in this model. A closure-based learner makes at most $\mathrm{ExThD}(\mathcal{H})$ mistakes against any adaptive adversary, and no algorithm can perform better. For stochastic adversaries, we prove a similar bound for every intersection-closed class. The replay setting is provably harder than the classical mistake bound setting: some classes have constant Littlestone dimension but arbitrarily large $\mathrm{ExThD}(\mathcal{H})$. Proper learning exhibits an even sharper separation: a class is properly learnable under replay if and only if it is (almost) intersection-closed. Otherwise, every proper learner suffers $Ω(T)$ errors, whereas our improper algorithm still achieves the $\mathrm{ExThD}(\mathcal{H})$ bound. These results give the first tight analysis of learning against replay adversaries, based on new results for closure-type algorithms.

adversary, algorithm, learner, (12 more...)

arXiv.org Machine Learning

2509.25135

Country:

North America > United States > Pennsylvania (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Education > Educational Setting > Online (0.71)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.69)

Add feedback

Singleton-Optimized Conformal Prediction

Wang, Tao, Sun, Yan, Dobriban, Edgar

arXiv.org Machine LearningSep-30-2025

Conformal prediction can be used to construct prediction sets that cover the true outcome with a desired probability, but can sometimes lead to large prediction sets that are costly in practice. The most useful outcome is a singleton prediction-an unambiguous decision-yet existing efficiency-oriented methods primarily optimize average set size. Motivated by this, we propose a new nonconformity score that aims to minimize the probability of producing non-singleton sets. Starting from a non-convex constrained optimization problem as a motivation, we provide a geometric reformulation and associated algorithm for computing the nonconformity score and associated split conformal prediction sets in O(K) time for K-class problems. Using this score in split conformal prediction leads to our proposed Singleton-Optimized Conformal Prediction (SOCOP) method. We evaluate our method in experiments on image classification and LLM multiple-choice question-answering, comparing with standard nonconformity scores such as the (negative) label probability estimates and their cumulative distribution function; both of which are motivated by optimizing length. The results show that SOCOP increases singleton frequency (sometimes by over 20%) compared to the above scores, with minimal impact on average set size.

nonconformity score, prediction, socop, (13 more...)

arXiv.org Machine Learning

2509.24095

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > New Jersey (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.87)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don't Know

Li, Albus Yizhuo

arXiv.org Machine LearningSep-30-2025

The Mixture-of-Experts (MoE) architecture has enabled the creation of massive yet efficient Large Language Models (LLMs). However, the standard deterministic routing mechanism presents a significant limitation: its inherent brittleness is a key contributor to model miscalibration and overconfidence, resulting in systems that often do not know what they don't know. This thesis confronts this challenge by proposing a structured \textbf{Bayesian MoE routing framework}. Instead of forcing a single, deterministic expert selection, our approach models a probability distribution over the routing decision itself. We systematically investigate three families of methods that introduce this principled uncertainty at different stages of the routing pipeline: in the \textbf{weight-space}, the \textbf{logit-space}, and the final \textbf{selection-space}. Through a series of controlled experiments on a 3-billion parameter MoE model, we demonstrate that this framework significantly improves routing stability, in-distribution calibration, and out-of-distribution (OoD) detection. The results show that by targeting this core architectural component, we can create a more reliable internal uncertainty signal. This work provides a practical and computationally tractable pathway towards building more robust and self-aware LLMs, taking a crucial step towards making them know what they don't know.

calibration, experiment, router, (16 more...)

arXiv.org Machine Learning

2509.2383

Country:

North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.94)
(3 more...)

Add feedback

Learning single index model with gradient descent: spectral initialization and precise asymptotics

Chen, Yuchen, Shen, Yandi

arXiv.org Machine LearningSep-30-2025

Non-convex optimization plays a central role in many statistics and machine learning problems. Despite the landscape irregularities for general non-convex functions, some recent work showed that for many learning problems with random data and large enough sample size, there exists a region around the true signal with benign landscape. Motivated by this observation, a widely used strategy is a two-stage algorithm, where we first apply a spectral initialization to plunge into the region, and then run gradient descent for further refinement. While this two-stage algorithm has been extensively analyzed for many non-convex problems, the precise distributional property of both its transient and long-time behavior remains to be understood. In this work, we study this two-stage algorithm in the context of single index models under the proportional asymptotics regime. We derive a set of dynamical mean field equations, which describe the precise behavior of the trajectory of spectral initialized gradient descent in the large system limit. We further show that when the spectral initialization successfully lands in a region of benign landscape, the above equation system is asymptotically time translation invariant and exponential converging, and thus admits a set of long-time fixed points that represents the mean field characterization of the limiting point of the gradient descent dynamic. As a proof of concept, we demonstrate our general theory in the example of regularized Wirtinger flow for phase retrieval.

assumption 2, diag, nullnull null, (15 more...)

arXiv.org Machine Learning

2509.23527

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.40)

Industry: Education > Focused Education > Special Education (0.44)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Statistical Inference for Gradient Boosting Regression

Fang, Haimo, Tan, Kevin, Hooker, Giles

arXiv.org Machine LearningSep-30-2025

Gradient boosting is widely popular due to its flexibility and predictive accuracy. However, statistical inference and uncertainty quantification for gradient boosting remain challenging and under-explored. We propose a unified framework for statistical inference in gradient boosting regression. Our framework integrates dropout or parallel training with a recently proposed regularization procedure that allows for a central limit theorem (CLT) for boosting. With these enhancements, we surprisingly find that increasing the dropout rate and the number of trees grown in parallel at each iteration substantially enhances signal recovery and overall performance. Our resulting algorithms enjoy similar CLTs, which we use to construct built-in confidence intervals, prediction intervals, and rigorous hypothesis tests for assessing variable importance. Numerical experiments demonstrate that our algorithms perform well, interpolate between regularized boosting and random forests, and confirm the validity of their built-in statistical inference procedures.

algorithm 1, algorithm 2, zhou and hooker, (13 more...)

arXiv.org Machine Learning

2509.23127

Country: North America > United States > Pennsylvania (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Metadata-Guided Adaptable Frequency Scaling across Heterogeneous Applications and Devices

Yan, Jinqi, He, Fang, Sang, Qianlong, Tong, Bifeng, Sun, Peng, Gong, Yili, Hu, Chuang, Cheng, Dazhao

arXiv.org Machine LearningSep-30-2025

Abstract--Dynamic V oltage and Frequency Scaling (DVFS) is essential for enhancing energy efficiency in mobile platforms. However, traditional heuristic-based governors are increasingly inadequate for managing the complexity of heterogeneous System-on-Chip designs and diverse application workloads. Although reinforcement learning approaches offer improved performance, their poor generalization capability and reliance on extensive retraining for each hardware and application combination leads to significant deployment costs. In this work, we observe that device and application metadata inherently encapsulate valuable knowledge for DVFS, presenting an opportunity to overcome these limitations. We formulate DVFS for heterogeneous devices and applications as a multi-task reinforcement learning problem. We introduce MetaDVFS, which is a metadata-guided framework that systematically leverages metadata to discover and transfer shared knowledge across DVFS tasks. Evaluations on five Google Pixel devices running six applications show that MetaDVFS achieves up to 17% improvement in Performance-Power Ratio and up to 26% improvement in Quality of Experience. Compared to state-of-the-art methods, MetaDVFS delivers 70.8% faster adaptation (3.5 1.1 vs. 11.8 5.2 minutes) and 5.8-27.6% These results establish MetaDVFS as an effective and scalable solution for DVFS deployment in heterogeneous mobile environments. Dynamic V oltage and Frequency Scaling (DVFS) is an essential technique for effectively improving energy efficiency in battery-powered mobile platforms. DVFS adjusts the operating voltage and frequency of a device in response to current workload demands [1]. Experimental evaluations report energy savings exceeding 26% on mobile MPSoCs where DVFS functions compared to statically managed systems [2]. Traditional DVFS policies typically rely on heuristic-based governors, such as ondemand and schedutil, which make frequency decisions based primarily on simple utilization metrics. Jinqi Y an, Qianlong Sang, Yili Gong, Chuang Hu, and Dazhao Cheng are with the School of Computer Science, Wuhan University.

application, device-application combination, metadata, (15 more...)

arXiv.org Machine Learning

2509.22707

Country:

Asia > China > Hubei Province > Wuhan (0.25)
Europe (0.04)
Asia > China > Hong Kong (0.04)
(3 more...)

Genre: Research Report (0.70)

Industry:

Semiconductors & Electronics (1.00)
Information Technology (1.00)
Education (1.00)
Energy (0.88)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

Singh, Aasheesh, Vaddina, Vishal, Birru, Dagnachew

arXiv.org Artificial IntelligenceSep-30-2025

We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

distillation, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.251

Country:

North America > Canada (0.16)
North America > United States (0.15)

Genre: Research Report (0.40)

Industry: Education (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Add feedback