Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes

Neural Information Processing Systems 

An alternative is to randomize patients' treatments at each stage based on the previous decisions and observed outcomes; for instance, one popular strategy is known as the sequential multiple assignment randomized trail (SMART) [