Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU