TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DNABERT-2: Efficient Foundation Model and Benchmark For Mu...

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, Han Liu

2023-06-26Covid Variant PredictionPromoter DetectionTranscription Factor Binding Site Prediction (Mouse)Genome UnderstandingSplice Site PredictionTranscription Factor Binding Site Prediction (Human)Epigenetic Marks PredictionCore Promoter Detection
PaperPDFCode(official)CodeCode(official)CodeCodeCode

Abstract

Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.

Results

TaskDatasetMetricValueModel
Promoter DetectionGUEMCC84.21DNABERT-2-117M
Core Promoter DetectionGUEMCC70.52DNABERT-2-117M
Splice Site PredictionGUEMCC84.99DNABERT-2-117M
Covid Variant PredictionGUEAvg F171.02DNABERT-2-117M
Epigenetic Marks PredictionGUEMCC55.98DNABERT-2-117M
Transcription Factor Binding Site PredictionGUEMCC70.1DNABERT-2-117M
Transcription Factor Binding Site PredictionGUEMCC67.99DNABERT-2-117M

Related Papers

BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects2025-06-26GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models2025-05-16Fast and Low-Cost Genomic Foundation Models via Outlier Removal2025-05-01AdvanceSplice: Integrating N-gram one-hot encoding and ensemble modeling for enhanced accuracy2024-02-17Efficient and Scalable Fine-Tune of Language Models for Genome Understanding2024-02-12Identifying DNA Sequence Motifs Using Deep Learning2023-11-20Sequential Labelling and DNABERT For Splice Site Prediction in Homo Sapiens DNA2022-12-15SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study2022-04-14