Multi-modal topic inferencing from videos