Search
d1e7b08bdb7783ed4fb10abe92c22ffd-AuthorFeedback.pdf
After thek trajectories, one best trajectory is extracted by running without the8 exploration bonus, and that trajectory is"distilled" into the policyby performing agradient update toincreaseits9 probability. The abovework onsolving23 combinatorial optimization problems using RL is based on the premise that there is room for improvement over24 traditionalsolvers. Please also note that the specific algorithm suggested is very similar to our "full bandit" baseline.35
LearningtoMutatewithHypergradientGuided Population
Toaddress theabovechallenges, wepropose anovelhyperparameter mutation (HPM) scheduling algorithm in this study, which adopts a population based training framework to explicitly learn a trade-off (i.e., a mutation schedule) between using the hypergradient-guided local search and the mutation-driven global search.