Goto

Collaborating Authors

 shortest path problem


Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Neural Information Processing Systems

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve O(logT) regret in stochastic settings and O( T) regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknowntransition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.


Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Neural Information Processing Systems

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging \textit{aggregate bandit feedback} model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of \textit{best-of-both-worlds} (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.



Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem

arXiv.org Artificial Intelligence

In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algorithms. We further observe reliable performance of our function approximation algorithm compared to other algorithms in the function approximation setting.


APULSE: A Scalable Hybrid Algorithm for the RCSPP on Large-Scale Dense Graphs

arXiv.org Artificial Intelligence

Abstract--The resource-constrained shortest path problem (RCSPP) is a fundamental NP-hard optimization challenge with broad applications, from network routing to autonomous navigation. This problem involves finding a path that minimizes a primary cost subject to a budget on a secondary resource. While various RCSPP solvers exist, they often face critical scalability limitations when applied to the large, dense graphs characteristic of complex, real-world scenarios, making them impractical for time-critical planning. This challenge is particularly acute in domains like mission planning for unmanned ground vehicles (UGVs), which demand solutions on large-scale terrain graphs. This paper introduces APULSE, a hybrid label-setting algorithm designed to efficiently solve the RCSPP on such challenging graphs. APULSE integrates a best-first search guided by an A* heuristic with aggressive, Pulse-style pruning mechanisms and a time-bucketing strategy for effective state-space reduction. The results demonstrate that APULSE consistently finds near-optimal solutions while being orders of magnitude faster and more robust, particularly on large problem instances where competing methods fail. This superior scalability establishes APULSE as an effective solution for RCSPP in complex, large-scale environments, enabling capabilities such as interactive decision support and dynamic replanning. HE Resource-Constrained Shortest Path Problem (RC-SPP) is a fundamental NP-hard optimization challenge with broad applications, from network routing and logistics to autonomous navigation [1].


ARGUS: A Framework for Risk-Aware Path Planning in Tactical UGV Operations

arXiv.org Artificial Intelligence

This thesis presents the development of ARGUS, a framework for mission planning for Unmanned Ground Vehicles (UGVs) in tactical environments. The system is designed to translate battlefield complexity and the commander's intent into executable action plans. To this end, ARGUS employs a processing pipeline that takes as input geospatial terrain data, military intelligence on existing threats and their probable locations, and mission priorities defined by the commander. Through a set of integrated modules, the framework processes this information to generate optimized trajectories that balance mission objectives against the risks posed by threats and terrain characteristics. A fundamental capability of ARGUS is its dynamic nature, which allows it to adapt plans in real-time in response to unforeseen events, reflecting the fluid nature of the modern battlefield. The system's interoperability were validated in a practical exercise with the Portuguese Army, where it was successfully demonstrated that the routes generated by the model can be integrated and utilized by UGV control systems. The result is a decision support tool that not only produces an optimal trajectory but also provides the necessary insights for its execution, thereby contributing to greater effectiveness and safety in the employment of autonomous ground systems.


Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

arXiv.org Machine Learning

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve $O(\log T)$ regret in stochastic settings and ${O}(\sqrt{T})$ regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknown-transition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.


Contextual Linear Optimization with Bandit Feedback

Neural Information Processing Systems

We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate, and we develop computationally tractable surrogate losses.



Challenges in Applying Variational Quantum Algorithms to Dynamic Satellite Network Routing

arXiv.org Artificial Intelligence

The advent of large-scale Low Earth Orbit (LEO) satellite constellations, spearheaded by initiatives such as SpaceX's Starlink, Amazon's Project Kuiper, and OneWeb, is poised to revolutionize global connectivity Saeed et al. (2020). By deploying thousands of interconnected satellites, these networks promise to deliver high-speed, low-latency internet access to every corner of the globe, including remote and underserved regions Reddy et al. (2023). However, the very characteristics that enable this new paradigm - namely, the massive scale and high orbital velocity of the satellites - introduce unprecedented challenges in network management Hu (2023). The network topology is in a constant state of flux, with inter-satellite links (ISLs) being established and terminated on a timescale of seconds, creating a highly dynamic and complex operational environment Bhattacharjee et al. (2024). At the heart of managing these constellations lies the network routing problem: determining the optimal path for data packets to travel from a source to a destination Zhang et al. (2025); Chen et al. (2021). In this dynamic context, the routing problem is far more complex than in terrestrial networks. It must account for time-varying latencies, intermittent link availability, and vast state spaces.