wt 1
- North America > United States > New York > Rensselaer County > Troy (0.04)
- Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
OnlineConvexOptimization withContinuousSwitchingConstraint
In many sequential decision making applications, the change of decision would bring an additional cost, such as the wear-and-tear cost associated with changing server status. To control the switching cost, we introduce the problem of online convex optimization with continuous switching constraint, where the goal is to achieve a small regret given a budget on the overall switching cost. We first investigate the hardness of the problem, and provide a lower bound of orderΩ( T)whentheswitchingcostbudgetS = Ω( T),andΩ(min{T/S,T}) whenS = O( T), where T is the time horizon. The essential idea is to carefully design an adaptive adversary, who can adjust the loss function according to thecumulative switchingcostofthe playerincurredso farbasedonthe orthogonal technique. We then develop a simple gradient-based algorithm which enjoys the minimax optimal regret bound.
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
sup
In the deterministic setting where the data is deterministically given without any probabilistic assumptions, significant advances inDP linear regression has been made [77,57,68, 16, 7, 83, 31, 67, 82, 71]. In the randomized settings where each example{xi,yi} is drawn i.i.d. We explain the closely related ones in Section 2.3, with analysis when the covariance matrixhasaspectralgap. The resulting utility guarantees are the same as those from [23], which are discussedinSection2.3. When privacy is not required, we know from Theorem 2.2 that under Assumptions A.1-A.3, we can achieve an error rate of O(κ p V/n).
Appendices ABernoulli-CRSProperties
Let us defineK Rn n a random diagonal sampling matrix whereKj,j Bernoulli(pj) for 1 j n. Therefore, Bernoulli-CRS will perform on average the same amount of computations as in the fixed-rankCRS. This formulation immediately hints atthe possibility tosample over the input channeldimension, similarly to sampling column-row pairs in matrices. Let ` be a β-Lipschitz loss function, and let the network be trained with SGD using properly decreasing learning rate. Let us denote the weight, bias and activation gradients with respect to a loss function` by Wl, bl, al respectively.
- Workflow (0.46)
- Research Report (0.46)