Enhanced Visual Scene Understanding through Human-Robot Dialog