An ORFome assembly approach to metagenomics sequences analysis. 2008

Yuzhen Ye, and Haixu Tang
School of Informatics, Indiana University, Bloomington, Indiana 47408, USA. yye@indiana.edu

Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.

UI MeSH Term Description Entries
D008969 Molecular Sequence Data Descriptions of specific amino acid, carbohydrate, or nucleotide sequences which have appeared in the published literature and/or are deposited in and maintained by databanks such as GENBANK, European Molecular Biology Laboratory (EMBL), National Biomedical Research Foundation (NBRF), or other sequence repositories. Sequence Data, Molecular,Molecular Sequencing Data,Data, Molecular Sequence,Data, Molecular Sequencing,Sequencing Data, Molecular
D002874 Chromosome Mapping Any method used for determining the location of and relative distances between genes on a chromosome. Gene Mapping,Linkage Mapping,Genome Mapping,Chromosome Mappings,Gene Mappings,Genome Mappings,Linkage Mappings,Mapping, Chromosome,Mapping, Gene,Mapping, Genome,Mapping, Linkage,Mappings, Chromosome,Mappings, Gene,Mappings, Genome,Mappings, Linkage
D004247 DNA A deoxyribonucleotide polymer that is the primary genetic material of all cells. Eukaryotic and prokaryotic organisms normally contain DNA in a double-stranded state, yet several important biological processes transiently involve single-stranded regions. DNA, which consists of a polysugar-phosphate backbone possessing projections of purines (adenine and guanine) and pyrimidines (thymine and cytosine), forms a double helix that is held together by hydrogen bonds between these purines and pyrimidines (adenine to thymine and guanine to cytosine). DNA, Double-Stranded,Deoxyribonucleic Acid,ds-DNA,DNA, Double Stranded,Double-Stranded DNA,ds DNA
D000465 Algorithms A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. Algorithm
D001483 Base Sequence The sequence of PURINES and PYRIMIDINES in nucleic acids and polynucleotides. It is also called nucleotide sequence. DNA Sequence,Nucleotide Sequence,RNA Sequence,DNA Sequences,Base Sequences,Nucleotide Sequences,RNA Sequences,Sequence, Base,Sequence, DNA,Sequence, Nucleotide,Sequence, RNA,Sequences, Base,Sequences, DNA,Sequences, Nucleotide,Sequences, RNA
D012689 Sequence Homology, Nucleic Acid The sequential correspondence of nucleotides in one nucleic acid molecule with those of another nucleic acid molecule. Sequence homology is an indication of the genetic relatedness of different organisms and gene function. Base Sequence Homology,Homologous Sequences, Nucleic Acid,Homologs, Nucleic Acid Sequence,Homology, Base Sequence,Homology, Nucleic Acid Sequence,Nucleic Acid Sequence Homologs,Nucleic Acid Sequence Homology,Sequence Homology, Base,Base Sequence Homologies,Homologies, Base Sequence,Sequence Homologies, Base
D016366 Open Reading Frames A sequence of successive nucleotide triplets that are read as CODONS specifying AMINO ACIDS and begin with an INITIATOR CODON and end with a stop codon (CODON, TERMINATOR). ORFs,Protein Coding Region,Small Open Reading Frame,Small Open Reading Frames,sORF,Unassigned Reading Frame,Unassigned Reading Frames,Unidentified Reading Frame,Coding Region, Protein,Frame, Unidentified Reading,ORF,Open Reading Frame,Protein Coding Regions,Reading Frame, Open,Reading Frame, Unassigned,Reading Frame, Unidentified,Region, Protein Coding,Unidentified Reading Frames
D016678 Genome The genetic complement of an organism, including all of its GENES, as represented in its DNA, or in some cases, its RNA. Genomes
D017422 Sequence Analysis, DNA A multistage process that includes cloning, physical mapping, subcloning, determination of the DNA SEQUENCE, and information analysis. DNA Sequence Analysis,Sequence Determination, DNA,Analysis, DNA Sequence,DNA Sequence Determination,DNA Sequence Determinations,DNA Sequencing,Determination, DNA Sequence,Determinations, DNA Sequence,Sequence Determinations, DNA,Analyses, DNA Sequence,DNA Sequence Analyses,Sequence Analyses, DNA,Sequencing, DNA
D023281 Genomics The systematic study of the complete DNA sequences (GENOME) of organisms. Included is construction of complete genetic, physical, and transcript maps, and the analysis of this structural genomic information on a global scale such as in GENOME WIDE ASSOCIATION STUDIES. Functional Genomics,Structural Genomics,Comparative Genomics,Genomics, Comparative,Genomics, Functional,Genomics, Structural

Related Publications

Yuzhen Ye, and Haixu Tang
May 2017, Bioinformatics (Oxford, England),
Yuzhen Ye, and Haixu Tang
November 2022, Biological procedures online,
Yuzhen Ye, and Haixu Tang
January 2012, Methods in molecular biology (Clifton, N.J.),
Yuzhen Ye, and Haixu Tang
January 2019, Methods in molecular biology (Clifton, N.J.),
Yuzhen Ye, and Haixu Tang
April 2024, Biological procedures online,
Yuzhen Ye, and Haixu Tang
December 1998, Mathematical biosciences,
Yuzhen Ye, and Haixu Tang
June 2012, Frontiers in bioscience (Scholar edition),
Yuzhen Ye, and Haixu Tang
March 2021, Proceedings of the National Academy of Sciences of the United States of America,
Yuzhen Ye, and Haixu Tang
July 2015, FEMS microbiology ecology,
Copied contents to your clipboard!