Episodic Memory Representation for Long-form Video Understanding