Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning