A Proofs of Theorems

Neural Information Processing Systems 

Under the assumption that the MDP is deterministic and all states are strongly connected, there exists at least one shortest state trajectory from s to g. We use the agent's trajectories to construct and update the adjacency matrix. Concretely, the adjacency matrix is initialized to an empty matrix at the beginning of training. Each time when the agent explores a new state that it has never visited before, the adjacency matrix is augmented by a new row and a new column with zero elements, representing the k-step adjacent relation between the new state and explored states. When the temporal distance between two states in one trajectory is not larger than k, then the corresponding element in the adjacency matrix will be labeled to 1, indicating the adjacency.