Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA
–Neural Information Processing Systems
Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
Neural Information Processing Systems
Feb-16-2026, 00:20:05 GMT