AITopics

Industry: Health & Medicine > Therapeutic Area (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.38)

Neural Information Processing SystemsFeb-17-2026, 07:36:39 GMT

a9619dd0f0d54a5cf7734add1dc38cd1-Paper-Conference.pdf

artificial intelligence, machine learning, sequence, (19 more...)

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre:

Research Report > Experimental Study (1.00)
Workflow (0.69)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

arXiv.org Artificial IntelligenceOct-30-2025

Beyond Leakage and Complexity: Towards Realistic and Efficient Information Cascade Prediction

Peng, Jie, Wang, Rui, Wang, Qiang, Wei, Zhewei, Tong, Bin, Wang, Guan

Information cascade popularity prediction is a key problem in analyzing content diffusion in social networks. However, current related works suffer from three critical limitations: (1) temporal leakage in current evaluation--random cascade-based splits allow models to access future information, yielding unrealistic results; (2) feature-poor datasets that lack downstream conversion signals (e.g., likes, comments, or purchases), which limits more practical applications; (3) computational inefficiency of complex graph-based methods that require days of training for marginal gains. We systematically address these challenges from three perspectives: task setup, dataset construction, and model design. First, we propose a time-ordered splitting strategy that chronologically partitions data into consecutive windows, ensuring models are evaluated on genuine forecasting tasks without future information leakage. Second, we introduce Taoke, a large-scale e-commerce cascade dataset featuring rich promoter/product attributes and ground-truth purchase conversions--capturing the complete diffusion lifecycle from promotion to monetization. Third, we develop CasTemp, a lightweight framework that efficiently models cascade dynamics through temporal walks, Jaccard-based neighbor selection for inter-cascade dependencies, and GRU-based encoding with time-aware attention. Under leak-free evaluation, CasTemp achieves state-of-the-art performance across four datasets with orders-of-magnitude speedup. Notably, it excels at predicting second-stage popularity conversions--a practical task critical for real-world applications.

artificial intelligence, data mining, machine learning, (20 more...)

2510.25348

Country: Asia > China (0.15)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.69)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsOct-10-2025, 12:42:51 GMT

a9619dd0f0d54a5cf7734add1dc38cd1-Paper-Conference.pdf

dataset, promoter, sequence, (16 more...)

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre:

Research Report > Experimental Study (1.00)
Workflow (0.69)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

arXiv.org Artificial IntelligenceSep-24-2025

Reverse-Complement Consistency for DNA Language Models

Ma, Mingqian

A fundamental property of DNA is that the reverse complement (RC) of a sequence often carries identical biological meaning. However, state-of-the-art DNA language models frequently fail to capture this symmetry, producing inconsistent predictions for a sequence and its RC counterpart, which undermines their reliability. In this work, we introduce Reverse-Complement Consistency Regularization (RCCR), a simple and model-agnostic fine-tuning objective that directly penalizes the divergence between a model's prediction on a sequence and the aligned prediction on its reverse complement. We evaluate RCCR across three diverse backbones (Nucleotide Transformer, HyenaDNA, DNABERT -2) on a wide range of genomic tasks, including sequence classification, scalar regression, and profile prediction. Our experiments show that RCCR substantially improves RC robustness by dramatically reducing prediction flips and errors, all while maintaining or improving task accuracy compared to baselines such as RC data augmentation and test-time averaging. By integrating a key biological prior directly into the learning process, RCCR produces a single, intrinsically robust, and computationally efficient model fine-tuning recipe for diverse biology tasks. DNA language models (DNA LMs) (Zhou et al., 2024; Dalla-Torre et al., 2025; Nguyen et al., 2023; Ma et al., 2025) have become general-purpose backbones for genomic prediction and sequence design: after pretraining on raw genomes, a single backbone can be fine-tuned for diverse downstream tasks. Many of these tasks possess an explicit symmetry: labels are reverse-complement (RC) invariant at the sequence level (e.g., promoter classification), or RC equivariant at the profile level, where outputs must be aligned by a task-specific operator Π (e.g., bin-wise outputs should be flipped along the sequence length axis, and strand channels swapped when present). Y et standard fine-tuning pipelines neither encode RC symmetry nor evaluate it systematically, leaving models sensitive to input orientation.

machine learning, natural language, prediction, (19 more...)

2509.18529

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Jiang, Shiyu, Liu, Xuyin, Wang, Zitong Jerry

Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences

arXiv.org Artificial IntelligenceAug-27-2025

Genomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.

large language model, machine learning, natural language, (17 more...)

2506.10271

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Neural Information Processing SystemsMay-27-2025, 12:02:30 GMT

Designing Cell-Type-Specific Promoter Sequences Using Conservative Model-Based Optimization

conservative model-based optimization, designing cell-type-specific promoter sequence, promoter, (4 more...)

Industry: Health & Medicine > Therapeutic Area (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Ghosh, Nimisha, Santoni, Daniele, Saha, Indrajit, Felici, Giovanni

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis

arXiv.org Artificial IntelligenceDec-10-2024

In recent times, Transformer-based language models are making quite an impact in the field of natural language processing. As relevant parallels can be drawn between biological sequences and natural languages, the models used in NLP can be easily extended and adapted for various applications in bioinformatics. In this regard, this paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences. We have reviewed and analysed a large number of application-based papers on this subject, giving evidence of the main characterizing features and to different approaches that may be adopted to customize such powerful computational machines. We have also provided a structured description of the functioning of Transformers, that may enable even first time users to grab the essence of such complex architectures. We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences. This work will motivate the readers to build on these methodologies to tackle also various other problems in the field of bioinformatics.

large language model, machine learning, natural language, (19 more...)

2412.07201

Country:

Europe > Italy > Tuscany > Florence (0.04)
Europe > Italy > Lazio > Rome (0.04)
Asia > India > West Bengal > Kolkata (0.04)
Asia > India > Odisha (0.04)

Genre:

Overview (0.86)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJul-6-2024

Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery

Peng, Zhiyuan, Tang, Yuanbo, Li, Yang

DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.

dna sequence, k-mers, representation, (15 more...)

2407.12051

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
(3 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Zhan, Huixin, Wu, Ying Nian, Zhang, Zijun

Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

arXiv.org Artificial IntelligenceFeb-12-2024

Although DNA foundation models have advanced the understanding of genomes, they still face significant challenges in the limited scale and diversity of genomic data. This limitation starkly contrasts with the success of natural language foundation models, which thrive on substantially larger scales. Furthermore, genome understanding involves numerous downstream genome annotation tasks with inherent data heterogeneity, thereby necessitating more efficient and robust finetuning methods tailored for genomics. Lingo further accommodates numerous, heterogeneous downstream fine-tune tasks by an adaptive rank sampling method that prunes and stochastically reintroduces pruned singular vectors within small computational budgets. Adaptive rank sampling outperformed existing fine-tuning methods on all benchmarked 14 genome understanding tasks, while requiring fewer than 2% of trainable parameters as genomic-specific adapters. Impressively, applying these adapters on natural language foundation models matched or even exceeded the performance of DNA foundation models. Lingo presents a new paradigm of efficient and scalable genome understanding via genomic-specific adapters on language models. DNA foundation models, such as DNABERT [1], DNABERT-2 [2], and Nucleotide Transformer (NT) [3], have made significant progress in decoding the linguistic intricacies of the genome. An important paradigm of utilizing such DNA foundation models is "pre-training+finetuning", i.e., pre-training on unlabeled genomic sequences, and then adaptation to a particular genome understanding task. A critical aspect of genome annotation and downstream tasks is their considerable number and diversity. For example, state-of-the-art deep learning models in epigenetics alone can encompass nearly 22,000 individual tasks [4].

adaptive rank, dna foundation model, foundation model, (16 more...)

2402.08075

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)