optimal value
Fast Rates for Inverse Reinforcement Learning
Schlaginhaufen, Andreas, Kamgarpour, Maryam
We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
Bias in Evaluation Processes: An Optimization-Based Model
Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.
161882dd2d19c716819081aee2c08b98-Supplemental.pdf
We restate the theoretical statements and the algorithms here for completeness and convenience. Given a normalized monotone submodular function f:2 V! The objective aims to find the partition such that the minimum-valued block in the partition is maximized. For a ground set V and its elements (v1,v2,...,v n) coming in an arbitrary streaming order, the output solution of Alg. 1 has The output solution of Alg. 2 has We construct a set cover function as the tight example for Corollary 1. We illustrate the set cover function graphically in Figure 1. The circles are the areas to cover for the set cover function and the green inner circles and the red triangles are elements in the ground set (the outer yellow circles are not elements). The inner circles (green) largely overlap with the outer circles (yellow).
0cddb777d3441326544e21b67f41bdc8-Supplemental-Conference.pdf
In this section, we prove the Theorem 2.1, which states a problem P and its' orthogonal transformed problem Q(P) = {{Qxi}Ni=1,f}have identical optimal solutions if Qis orthogonal matrix: QQT = QTQ = I. As we mentioned in Section 2.2, reward R is a function of a1:T (solution sequences), ||xi xj||i,j {1,...N} (relative distances) and f (nodes features). And Let R (P)is optimal value of problem P: i.e. Then, the remaining proof is to show Q(P)has an identical solution set with P. Let optimal solution set ฮ (P) = {ฯi(P)}Mi=1, where ฯi(P)indicates optimal solution of P and M is the number of heterogeneous optimal solution. Conversely, For any ฯi(P) ฮ (P), they have sample optimal value with Q(P): R(ฯi(P);P) = R (P) = R (Q(P)) Thus, ฯi(P) ฮ (Q(P)).
Large-Scale Price Optimization via Network Flow
This paper deals with price optimization, which is to find the best pricing strategy that maximizes revenue or profit, on the basis of demand forecasting models. Though recent advances in regression technologies have made it possible to reveal price-demand relationship of a large number of products, most existing price optimization methods, such as mixed integer programming formulation, cannot handle tens or hundreds of products because of their high computational costs. To cope with this problem, this paper proposes a novel approach based on network flow algorithms. We reveal a connection between supermodularity of the revenue and cross elasticity of demand. On the basis of this connection, we propose an efficient algorithm that employs network flow algorithms. The proposed algorithm can handle hundreds or thousands of products, and returns an exact optimal solution under an assumption regarding cross elasticity of demand. Even if the assumption does not hold, the proposed algorithm can efficiently find approximate solutions as good as other state-of-the-art methods, as empirical results show.
Learning Generalized Linear Programming Value Functions
We develop a theoretically-grounded learning method for the Generalized Linear Programming Value Function (GVF), which models the optimal value of a linear programming (LP) problem as its objective and constraint bounds vary. This function plays a fundamental role in algorithmic techniques for large-scale optimization, particularly in decomposition for two-stage mixed-integer linear programs (MILPs). This paper establishes a structural characterization of the GVF that enables it to be modeled as a particular neural network architecture, which we then use to learn the GVF in a way that benefits from three notable properties. First, our method produces a true under-approximation of the value function with respect to the constraint bounds. Second, the model is input-convex in the constraint bounds, which not only matches the structure of the GVF but also enables the trained model to be efficiently optimized over using LP. Finally, our learning method is unsupervised, meaning that training data generation does not require computing LP optimal values, which can be prohibitively expensive at large scales. We numerically show that our method can approximate the GVF well, even when compared to supervised methods that collect training data by solving an LP for each data point. Furthermore, as an application of our framework, we develop a fast heuristic method for large-scale two-stage MILPs with continuous second-stage variables, via a compact reformulation that can be solved faster than the full model linear relaxation at large scales and orders of magnitude faster than the original model.