Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates