Fine-grained Spatiotemporal Grounding on Egocentric Videos