How well do Large Language Models perform in Arithmetic tasks?