d1e7b08bdb7783ed4fb10abe92c22ffd-AuthorFeedback.pdf

Neural Information Processing Systems 

After thek trajectories, one best trajectory is extracted by running without the8 exploration bonus, and that trajectory is"distilled" into the policyby performing agradient update toincreaseits9 probability. The abovework onsolving23 combinatorial optimization problems using RL is based on the premise that there is room for improvement over24 traditionalsolvers. Please also note that the specific algorithm suggested is very similar to our "full bandit" baseline.35

Similar Docs  Excel Report  more

TitleSimilaritySource
None found