Predicting the Sequence Specificities of DNA-Binding Proteins by DNA Fine-Tuned Language Model With Decaying Learning Rates. 2023

Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang

DNA-binding proteins (DBPs) play vital roles in the regulation of biological systems. Although there are already many deep learning methods for predicting the sequence specificities of DBPs, they face two challenges as follows. Classic deep learning methods for DBPs prediction usually fail to capture the dependencies between genomic sequences since their commonly used one-hot codes are mutually orthogonal. Besides, these methods usually perform poorly when samples are inadequate. To address these two challenges, we developed a novel language model for mining DBPs using human genomic data and ChIP-seq datasets with decaying learning rates, named DNA Fine-tuned Language Model (DFLM). It can capture the dependencies between genome sequences based on the context of human genomic data and then fine-tune the features of DBPs tasks using different ChIP-seq datasets. First, we compared DFLM with the existing widely used methods on 69 datasets and we achieved excellent performance. Moreover, we conducted comparative experiments on complex DBPs and small datasets. The results show that DFLM still achieved a significant improvement. Finally, through visualization analysis of one-hot encoding and DFLM, we found that one-hot encoding completely cut off the dependencies of DNA sequences themselves, while DFLM using language models can well represent the dependency of DNA sequences. Source code are available at: https://github.com/Deep-Bioinfo/DFLM.

UI MeSH Term Description Entries
D004247 DNA A deoxyribonucleotide polymer that is the primary genetic material of all cells. Eukaryotic and prokaryotic organisms normally contain DNA in a double-stranded state, yet several important biological processes transiently involve single-stranded regions. DNA, which consists of a polysugar-phosphate backbone possessing projections of purines (adenine and guanine) and pyrimidines (thymine and cytosine), forms a double helix that is held together by hydrogen bonds between these purines and pyrimidines (adenine to thymine and guanine to cytosine). DNA, Double-Stranded,Deoxyribonucleic Acid,ds-DNA,DNA, Double Stranded,Double-Stranded DNA,ds DNA
D004268 DNA-Binding Proteins Proteins which bind to DNA. The family includes proteins which bind to both double- and single-stranded DNA and also includes specific DNA binding proteins in serum which can be used as markers for malignant diseases. DNA Helix Destabilizing Proteins,DNA-Binding Protein,Single-Stranded DNA Binding Proteins,DNA Binding Protein,DNA Single-Stranded Binding Protein,SS DNA BP,Single-Stranded DNA-Binding Protein,Binding Protein, DNA,DNA Binding Proteins,DNA Single Stranded Binding Protein,DNA-Binding Protein, Single-Stranded,Protein, DNA-Binding,Single Stranded DNA Binding Protein,Single Stranded DNA Binding Proteins
D006801 Humans Members of the species Homo sapiens. Homo sapiens,Man (Taxonomy),Human,Man, Modern,Modern Man
D000465 Algorithms A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. Algorithm
D016678 Genome The genetic complement of an organism, including all of its GENES, as represented in its DNA, or in some cases, its RNA. Genomes
D023281 Genomics The systematic study of the complete DNA sequences (GENOME) of organisms. Included is construction of complete genetic, physical, and transcript maps, and the analysis of this structural genomic information on a global scale such as in GENOME WIDE ASSOCIATION STUDIES. Functional Genomics,Structural Genomics,Comparative Genomics,Genomics, Comparative,Genomics, Functional,Genomics, Structural

Related Publications

Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
August 2015, Nature biotechnology,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
January 2021, IEEE/ACM transactions on computational biology and bioinformatics,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
January 2006, Methods in enzymology,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
December 2016, Proceedings. IEEE International Conference on Bioinformatics and Biomedicine,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
November 2023, Scientific reports,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
January 2016, F1000Research,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
November 2010, PloS one,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
May 2006, BMC bioinformatics,
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
January 2006, Methods in molecular biology (Clifton, N.J.),
Ying He, and Qinhu Zhang, and Siguo Wang, and Zhanheng Chen, and Zhen Cui, and Zhen-Hao Guo, and De-Shuang Huang
July 2019, Bioinformatics (Oxford, England),
Copied contents to your clipboard!