Can We Optimize Deep RL Policy Weights as Trajectory Modeling?