Out of the BLEU: how should we assess quality of the Code Generation models?