Goto

Collaborating Authors

 discount rate


Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning

Chen, Zhizuo, Allen, Theodore T.

arXiv.org Machine Learning

Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation, optimality conditions, and policy improvement under finite state and action spaces. Building on these results, we adapt dynamic programming and generalized Q-learning algorithms to NVMDPs, along with formal convergence proofs. For problems requiring function approximation, we extend the Policy Gradient Theorem and the policy improvement bound in Trust Region Policy Optimization (TRPO), offering proofs in both scalar and matrix forms. Empirical evaluations in a non-stationary gridworld environment demonstrate that NVMDP-based algorithms successfully recover optimal trajectories under multiple reward and discounting schemes, whereas original Q-learning fails. These results collectively show that NVMDPs provide a theoretically sound and practically effective framework for reinforcement learning, requiring only minor algorithmic modifications while enabling robust handling of non-stationarity and explicit optimal policy shaping.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper presents an algorithm that achieves optimal regret for sellers in posted-price auctions with strategic buyers. The intuition behind the definition of Regret is not clear enough, what does a small regret mean for the seller. There should be more elaboration on the intuition. The paper is well-written with proofs and theorems clearly stated.


The Paradox of Doom: Acknowledging Extinction Risk Reduces the Incentive to Prevent It

Growiec, Jakub, Prettner, Klaus

arXiv.org Artificial Intelligence

We investigate the salience of extinction risk as a source of impatience. Our framework distinguishes between human extinction risk and individual mortality risk while allowing for various degrees of intergenerational altruism. Additionally, we consider the evolutionarily motivated "selfish gene" perspective. We find that the risk of human extinction is an indispensable component of the discount rate, whereas individual mortality risk can be hedged against - partially or fully, depending on the setup - through human reproduction. Overall, we show that in the face of extinction risk, people become more impatient rather than more farsighted. Thus, the greater the threat of extinction, the less incentive there is to invest in avoiding it. Our framework can help explain why humanity consistently underinvests in mitigation of catastrophic risks, ranging from climate change mitigation, via pandemic prevention, to addressing the emerging risks of transformative artificial intelligence.



Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing

Wu, Yanran, Hua, Inez, Ding, Yi

arXiv.org Artificial Intelligence

Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at https://github.com/jojacola/SCARF.


Function-Coherent Gambles

Wheeler, Gregory

arXiv.org Artificial Intelligence

The desirable gambles framework provides a foundational approach to imprecise probability theory but relies heavily on linear utility assumptions. This paper introduces {\em function-coherent gambles}, a generalization that accommodates non-linear utility while preserving essential rationality properties. We establish core axioms for function-coherence and prove a representation theorem that characterizes acceptable gambles through continuous linear functionals. The framework is then applied to analyze various forms of discounting in intertemporal choice, including hyperbolic, quasi-hyperbolic, scale-dependent, and state-dependent discounting. We demonstrate how these alternatives to constant-rate exponential discounting can be integrated within the function-coherent framework. This unified treatment provides theoretical foundations for modeling sophisticated patterns of time preference within the desirability paradigm, bridging a gap between normative theory and observed behavior in intertemporal decision-making under genuine uncertainty.


RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Hisaki, Yukinari, Ono, Isao

arXiv.org Artificial Intelligence

These learning (DRL) method utilizing the methods utilize the discounted reward criterion, which is average reward criterion. While most existing applicable to a variety of MDP-formulated tasks (Puterman, DRL methods employ the discounted reward criterion, 1994). In particular, for continuing tasks where there is this can potentially lead to a discrepancy no natural breakpoint in episodes, such as in robot locomotion between the training objective and performance (Todorov et al., 2012) or Access Control Queuing metrics in continuing tasks, making the average Tasks(Sutton & Barto, 2018), where the interaction between reward criterion a recommended alternative. We an agent and an environment can continue indefinitely, the introduce RVI-SAC, an extension of the state-ofthe-art discount rate plays a role in keeping the infinite horizon off-policy DRL method, Soft Actor-Critic return bounded. However, discounting introduces an undesirable (SAC) (Haarnoja et al., 2018a;b), to the average reward effect in continuing tasks by prioritizing rewards criterion. Our proposal consists of (1) Critic closer to the current time over those in the future. An approach updates based on RVI Q-learning (Abounadi et al., to mitigate this effect is to bring the discount rate 2001), (2) Actor updates introduced by the average closer to 1, but it is commonly known that a large discount reward soft policy improvement theorem, and rate can lead to instability and slower convergence(Fujimoto (3) automatic adjustment of Reset Cost enabling et al., 2018; Dewanto & Gallagher, 2021).


Time preference, wealth and utility inequality: A microeconomic interaction and dynamic macroeconomic model connection approach

Kato, Takeshi

arXiv.org Artificial Intelligence

Based on interactions between individuals and others and references to social norms, this study reveals the impact of heterogeneity in time preference on wealth distribution and inequality. We present a novel approach that connects the interactions between microeconomic agents that generate heterogeneity to the dynamic equations for capital and consumption in macroeconomic models. Using this approach, we estimate the impact of changes in the discount rate due to microeconomic interactions on capital, consumption and utility and the degree of inequality. The results show that intercomparisons with others regarding consumption significantly affect capital, i.e. wealth inequality. Furthermore, the impact on utility is never small and social norms can reduce this impact. Our supporting evidence shows that the quantitative results of inequality calculations correspond to survey data from cohort and cross-cultural studies. This study's micro-macro connection approach can be deployed to connect microeconomic interactions, such as exchange, interest and debt, redistribution, mutual aid and time preference, to dynamic macroeconomic models.


Consensus group decision making under model uncertainty with a view towards environmental policy making

Koundouri, Phoebe, Papayiannis, Georgios I., Petracou, Electra V., Yannacopoulos, Athanasios N.

arXiv.org Artificial Intelligence

Group decision making is an important field with interesting applications in various disciplines, among which environmental economics. Group decision, often requires that all or the majority of agents in the group agree to a single proposal or opinion, i.e. consensus. This is particularly true in cases where there is no coercion involved in the implementation of the decision made, so that the implementation of the decision depends on the good will, or rather the acceptance of the common decision by all members of the group. To make the discussion more concrete we consider the following generic situation: Assume that a group of agents, G, has to reach a common decision concerning policies regarding a future contingency X. Policies may refer for instance to the cost of abatement measures for protection against X, which clearly require the acceptance of a commonly acceptable estimate for the value of X by every member of the group as well as the acceptance of a commonly acceptably discount factor. Typically, different member of the group will have different valuations for X, therefore report different costs for the adverse effects of X. Moreover, different members of the group will have different discount rates for calculating the present value of the future adverse effect X.