Getting the most out of your tokenizer for pre-training and domain adaptation