TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Human Genome Book: Words, Sentences and Paragraphs

Human Genome Book: Words, Sentences and Paragraphs

Wang Liang

2025-01-23Protein Structure PredictionTransfer LearningSentence segmentation
PaperPDFCode(official)

Abstract

Since the completion of the human genome sequencing project in 2001, significant progress has been made in areas such as gene regulation editing and protein structure prediction. However, given the vast amount of genomic data, the segments that can be fully annotated and understood remain relatively limited. If we consider the genome as a book, constructing its equivalents of words, sentences, and paragraphs has been a long-standing and popular research direction. Recently, studies on transfer learning in large language models have provided a novel approach to this challenge.Multilingual transfer ability, which assesses how well models fine-tuned on a source language can be applied to other languages, has been extensively studied in multilingual pre-trained models. Similarly, the transfer of natural language capabilities to "DNA language" has also been validated. Building upon these findings, we first trained a foundational model capable of transferring linguistic capabilities from English to DNA sequences. Using this model, we constructed a vocabulary of DNA words and mapped DNA words to their English equivalents.Subsequently, we fine-tuned this model using English datasets for paragraphing and sentence segmentation to develop models capable of segmenting DNA sequences into sentences and paragraphs. Leveraging these models, we processed the GRCh38.p14 human genome by segmenting, tokenizing, and organizing it into a "book" comprised of genomic "words," "sentences," and "paragraphs." Additionally, based on the DNA-to-English vocabulary mapping, we created an "English version" of the genomic book. This study offers a novel perspective for understanding the genome and provides exciting possibilities for developing innovative tools for DNA search, generation, and analysis.

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12Conformation-Aware Structure Prediction of Antigen-Recognizing Immune Proteins2025-07-11The Bayesian Approach to Continual Learning: An Overview2025-07-11Contrastive and Transfer Learning for Effective Audio Fingerprinting through a Real-World Evaluation Protocol2025-07-08