Goto

Collaborating Authors

 Optimization


Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

arXiv.org Artificial Intelligence

We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings. Large Language Models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks (V aswani et al., 2017; Trinh et al., 2024; Chervonyi et al., 2025; Guo et al., 2025; Christianos et al., 2023), but their size and complexity make them impractical for deployment in resource-constrained environments. Distillation (Hinton et al., 2015; Czarnecki et al., 2019), a technique where a smaller student model learns from a larger teacher model, has been widely used to transfer knowledge while reducing computational costs. Conventional distillation methods (Sanh et al., 2020; Gu et al., 2024; Ko et al., 2024) typically focus on minimizing the divergence between the student and teacher models, often using metrics such as Kullback-Leibler (KL) divergence. However, these methods do not fully leverage additional reward signals that can provide valuable guidance, particularly in tasks requiring complex reasoning.


Parameter Identification of a Differentiable Human Arm Musculoskeletal Model without Deep Muscle EMG Reconstruction

arXiv.org Artificial Intelligence

Accurate parameter identification of a subject-specific human musculoskeletal model is crucial to the development of safe and reliable physically collaborative robotic systems, for instance, assistive exoskeletons. Electromyography (EMG)-based parameter identification methods have demonstrated promising performance for personalized musculoskeletal modeling, whereas their applicability is limited by the difficulty of measuring deep muscle EMGs invasively. Although several strategies have been proposed to reconstruct deep muscle EMGs or activations for parameter identification, their reliability and robustness are limited by assumptions about the deep muscle behavior. In this work, we proposed an approach to simultaneously identify the bone and superficial muscle parameters of a human arm musculoskeletal model without reconstructing the deep muscle EMGs. This is achieved by only using the least-squares solution of the deep muscle forces to calculate a loss gradient with respect to the model parameters for identifying them in a framework of differentiable optimization. The results of extensive comparative simulations manifested that our proposed method can achieve comparable estimation accuracy compared to a similar method, but with all the muscle EMGs available.


Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization

arXiv.org Artificial Intelligence

This is the accepted version of the paper publis hed in 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . Given Name Surname line 2: dept. Given Name Surname line 2: dept. Abstract -- This study presents a machine learning - assisted approach to optimize task scheduling in cluster systems, focusing on node - affinity constraints. Traditional schedulers like Kubernetes struggle with real - time adaptability, whereas the proposed continuous transfer learning model evolves dynamically during operations, minimizing retraining needs. Evaluated on Google Cluster Data, the model achieves over 99% accuracy, reducing computational overhead and improving scheduling latency for constrained tasks. This scalable solution enables real - time optimization, advancing ma chine learning integration in cluster management and paving the way for future adaptive scheduling strategies. In the rapidly evolving landscape of cloud computing and distributed high - performance environments, the efficient management of architectural and software resources became apparently paramount for ensuring suitable performance and minimizing latency. As long as the industry organizations increasingly rely on cluster - based architectures to orchestrate their broad areas of possible applications, the importance of effective task scheduling has come to the forefront . Over the last few years, traditional schedulers, such as Kubernetes and some more, have laid the groundwork for managing containerized workloads; however, it was found that it poses a challenge for them to adapt to the dynamic nature of real - time workloads and node - affinity constraints [ 35 ] . These limitations result in inefficient resource utilization and longer scheduling delays, which ultimately affect overall system performance, especially in high - performance systems [9][18] . In mission - critical environments, these issues can escalate, disrupting vital systems like power networks, healthcare, defen s e systems, and others.


Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering

arXiv.org Artificial Intelligence

We introduce a novel framework for clustering a collection of tall matrices based on their column spaces, a problem we term Subspace Clustering of Subspaces (SCoS). Unlike traditional subspace clustering methods that assume vectorized data, our formulation directly models each data sample as a matrix and clusters them according to their underlying subspaces. We establish conceptual links to Subspace Clustering and Generalized Canonical Correlation Analysis (GCCA), and clarify key differences that arise in this more general setting. Our approach is based on a Block Term Decomposition (BTD) of a third-order tensor constructed from the input matrices, enabling joint estimation of cluster memberships and partially shared subspaces. We provide the first identifiability results for this formulation and propose scalable optimization algorithms tailored to large datasets. Experiments on real-world hyperspectral imaging datasets demonstrate that our method achieves superior clustering accuracy and robustness, especially under high noise and interference, compared to existing subspace clustering techniques. These results highlight the potential of the proposed framework in challenging high-dimensional applications where structure exists beyond individual data vectors.


Probabilistic Consistency in Machine Learning and Its Connection to Uncertainty Quantification

arXiv.org Artificial Intelligence

Machine learning (ML) is often viewed as a powerful data analysis tool that is easy to learn because of its black-box nature. Yet this very nature also makes it difficult to quantify confidence in predictions extracted from ML models, and more fundamentally, to understand how such models are mathematical abstractions of training data. The goal of this paper is to unravel these issues and their connections to uncertainty quantification (UQ) by pursuing a line of reasoning motivated by diagnostics. In such settings, prevalence - i.e. the fraction of elements in class - is often of inherent interest. Here we analyze the many interpretations of prevalence to derive a level-set theory of classification, which shows that certain types of self-consistent ML models are equivalent to class-conditional probability distributions. We begin by studying the properties of binary Bayes optimal classifiers, recognizing that their boundary sets can be reinterpreted as level-sets of pairwise density ratios. By parameterizing Bayes classifiers in terms of the prevalence, we then show that they satisfy important monotonicity and class-switching properties that can be used to deduce the density ratios without direct access to the boundary sets. Moreover, this information is sufficient for tasks such as constructing the multiclass Bayes-optimal classifier and estimating inherent uncertainty in the class assignments. In the multiclass case, we use these results to deduce normalization and self-consistency conditions, the latter being equivalent to the law of total probability for classifiers. We also show that these are necessary conditions for arbitrary ML models to have valid probabilistic interpretations. Throughout we demonstrate how this analysis informs the broader task of UQ for ML via an uncertainty propagation framework.


Joint Multi-Target Detection-Tracking in Cognitive Massive MIMO Radar via POMCP

arXiv.org Artificial Intelligence

This correspondence presents a power-aware cognitive radar framework for joint detection and tracking of multiple targets in a massive multiple-input multiple-output (MIMO) radar environment. Building on a previous single-target algorithm based on Partially Observable Monte Carlo Planning (POMCP), we extend it to the multi-target case by assigning each target an independent POMCP tree, enabling scalable and efficient planning. Departing from uniform power allocation, which is often suboptimal with varying signal-to-noise ratios (SNRs), our approach predicts each target's future angular position and expected received power based on its expected range. These predictions guide adaptive waveform design via a constrained optimization problem that allocates transmit energy to enhance the detectability of weaker or distant targets, while ensuring sufficient power for high-SNR targets. Simulations involving multiple targets with different SNRs confirm the effectiveness of our method. The proposed framework for the cognitive radar improves detection probability for low-SNR targets and achieves more accurate tracking compared to approaches using uniform or orthogonal waveforms. These results demonstrate the potential of the POMCP-based framework for adaptive, efficient multi-target radar systems.


Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

arXiv.org Artificial Intelligence

Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.


TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

arXiv.org Artificial Intelligence

C. VLA Post-training Framework in Simulation After improving the computation of relative advantages and deriving the corresponding optimization objective, we integrated these components into a complete online reinforcement learning framework for VLA post-training in simulation. First, our overall framework trains a VLA model for a single task using reinforcement learning across multiple environments initialized with identical states. In this setup, the VLA executes the same task in parallel environments, sampling actions step by step until either one environment completes the task or all environments reach the maximum number of steps. During sampling, we incorporate the multistage reward function designed by the LLM described earlier, where each environment's observations provide the necessary object positions and robot state information required for reward computation. Once the trajectories terminate simultaneously, they all share the same length, which facilitates consistent grouping for subsequent processing. After collecting multiple trajectories, they are organized into a trajectory-level group, where relative advantages are computed within the group according to Eq. (3), yielding trajectory-level relative advantages. Similarly, since all trajectories terminate at the same timestep (ensuring that every step can be grouped), we extract step-level data across trajectories (e.g., rewards and log probabilities of actions), and group together steps at the same timestep to form step-level groups. Step-level relative advantages are then computed using Eq.


CoTune: Co-evolutionary Configuration Tuning

arXiv.org Artificial Intelligence

To automatically tune configurations for the best possible system performance (e.g., runtime or throughput), much work has been focused on designing intelligent heuristics in a tuner. However, existing tuner designs have mostly ignored the presence of complex performance requirements (e.g., the latency shall ideally be 2 seconds), but simply assume that better performance is always more preferred. This would not only waste valuable information in a requirement but might also consume extensive resources to tune for a goal with little gain. Yet, prior studies have shown that simply incorporating the requirement as a tuning objective is problematic since the requirement might be too strict, harming convergence; or its highly diverse satisfactions might lead to premature convergence. In this paper, we propose CoTune, a tool that takes the information of a given target performance requirement into account through co-evolution. CoTune is unique in the sense that it creates an auxiliary performance requirement to be co-evolved with the configurations, which assists the target performance requirement when it becomes ineffective or even misleading, hence allowing the tuning to be guided by the requirement while being robust to its harm. Experiment results on 162 cases (nine systems and 18 requirements) reveal that CoTune considerably outperforms existing tuners, ranking as the best for 90% cases (against the 0%--35% for other tuners) with up to 2.9x overall improvements, while doing so under a much better efficiency.


Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory

arXiv.org Artificial Intelligence

Despite remarkable advances, large language models often fail at compositional reasoning tasks, a phenomenon exemplified by the ``curse of two-hop reasoning''. This paper introduces the Identity Bridge, a simple yet powerful mechanism that resolves this compositionality gap by supervising the model on a zero-hop identity task. We demonstrate empirically that this addition enables models to successfully perform out-of-distribution two-hop reasoning, a task they otherwise completely fail. To explain this phenomenon, we provide a theoretical analysis using a simplified Emb-MLP model, proving that identity supervision reshapes the model's latent geometry. We show this alignment is induced by an implicit nuclear-norm regularization during optimization, which favors low-rank solutions that share structure across tasks. For complex tasks, we use small initialization or weight decay to enhance the regularization effect, which enhances the latent space alignment effect and slows down the generalization decay. Finally, we extend our investigation to large-scale models, observing that they still achieve two-hop reasoning through the latent memory, which provides crucial inspiration for enhancing their implicit reasoning abilities.