Goto

Collaborating Authors

 single nucleotide resolution


HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Neural Information Processing Systems

Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution (i.e. DNA characters) where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity.


HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Neural Information Processing Systems

Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context ( 0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution (i.e. DNA "characters") where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity.


M5: A Whole Genome Bacterial Encoder at Single Nucleotide Resolution

Egilsson, Agust

arXiv.org Artificial Intelligence

A linear attention mechanism is described to extend the context length of an encoder only transformer, called M5 in this report, to a multi-million single nucleotide resolution foundation model pretrained on bacterial whole genomes. The linear attention mechanism used approximates a full quadratic attention mechanism tightly and has a simple and lightweight implementation for the use case when the key-query embedding dimensionality is low. The M5-small model is entirely trained and tested on one A100 GPU with 40gb of memory up to 196K nucleotides during training and 2M nucleotides during testing. We test the performance of the M5-small model and record notable improvements in performance as whole genome bacterial sequence lengths are increased as well as demonstrating the stability of the full multi-head attention approximation used as sequence length is increased.