MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos