Genome annotation assessment in Drosophila melanogaster. 2000

M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology, University of California, Berkeley 94720-3200, USA. mgreese@lbl.gov

Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.

UI MeSH Term Description Entries
D011401 Promoter Regions, Genetic DNA sequences which are recognized (directly or indirectly) and bound by a DNA-dependent RNA polymerase during the initiation of transcription. Highly conserved sequences within the promoter include the Pribnow box in bacteria and the TATA BOX in eukaryotes. rRNA Promoter,Early Promoters, Genetic,Late Promoters, Genetic,Middle Promoters, Genetic,Promoter Regions,Promoter, Genetic,Promotor Regions,Promotor, Genetic,Pseudopromoter, Genetic,Early Promoter, Genetic,Genetic Late Promoter,Genetic Middle Promoters,Genetic Promoter,Genetic Promoter Region,Genetic Promoter Regions,Genetic Promoters,Genetic Promotor,Genetic Promotors,Genetic Pseudopromoter,Genetic Pseudopromoters,Late Promoter, Genetic,Middle Promoter, Genetic,Promoter Region,Promoter Region, Genetic,Promoter, Genetic Early,Promoter, rRNA,Promoters, Genetic,Promoters, Genetic Middle,Promoters, rRNA,Promotor Region,Promotors, Genetic,Pseudopromoters, Genetic,Region, Genetic Promoter,Region, Promoter,Region, Promotor,Regions, Genetic Promoter,Regions, Promoter,Regions, Promotor,rRNA Promoters
D004331 Drosophila melanogaster A species of fruit fly frequently used in genetics because of the large size of its chromosomes. D. melanogaster,Drosophila melanogasters,melanogaster, Drosophila
D000426 Alcohol Dehydrogenase A zinc-containing enzyme which oxidizes primary and secondary alcohols or hemiacetals in the presence of NAD. In alcoholic fermentation, it catalyzes the final step of reducing an aldehyde to an alcohol in the presence of NADH and hydrogen. Alcohol Dehydrogenase (NAD+),Alcohol Dehydrogenase I,Alcohol Dehydrogenase II,Alcohol-NAD+ Oxidoreductase,Yeast Alcohol Dehydrogenase,Alcohol Dehydrogenase, Yeast,Alcohol NAD+ Oxidoreductase,Dehydrogenase, Alcohol,Dehydrogenase, Yeast Alcohol,Oxidoreductase, Alcohol-NAD+
D000818 Animals Unicellular or multicellular, heterotrophic organisms, that have sensation and the power of voluntary movement. Under the older five kingdom paradigm, Animalia was one of the kingdoms. Under the modern three domain model, Animalia represents one of the many groups in the domain EUKARYOTA. Animal,Metazoa,Animalia
D016208 Databases, Factual Extensive collections, reputedly complete, of facts and data garnered from material of a specialized subject area and made available for analysis and application. The collection can be automated by various contemporary methods for retrieval. The concept should be differentiated from DATABASES, BIBLIOGRAPHIC which is restricted to collections of bibliographic references. Databanks, Factual,Data Banks, Factual,Data Bases, Factual,Data Bank, Factual,Data Base, Factual,Databank, Factual,Database, Factual,Factual Data Bank,Factual Data Banks,Factual Data Base,Factual Data Bases,Factual Databank,Factual Databanks,Factual Database,Factual Databases
D016678 Genome The genetic complement of an organism, including all of its GENES, as represented in its DNA, or in some cases, its RNA. Genomes
D017344 Genes, Insect The functional hereditary units of INSECTS. Insect Genes,Gene, Insect,Insect Gene
D017386 Sequence Homology, Amino Acid The degree of similarity between sequences of amino acids. This information is useful for the analyzing genetic relatedness of proteins and species. Homologous Sequences, Amino Acid,Amino Acid Sequence Homology,Homologs, Amino Acid Sequence,Homologs, Protein Sequence,Homology, Protein Sequence,Protein Sequence Homologs,Protein Sequence Homology,Sequence Homology, Protein,Homolog, Protein Sequence,Homologies, Protein Sequence,Protein Sequence Homolog,Protein Sequence Homologies,Sequence Homolog, Protein,Sequence Homologies, Protein,Sequence Homologs, Protein
D018076 DNA, Complementary Single-stranded complementary DNA synthesized from an RNA template by the action of RNA-dependent DNA polymerase. cDNA (i.e., complementary DNA, not circular DNA, not C-DNA) is used in a variety of molecular cloning experiments as well as serving as a specific hybridization probe. Complementary DNA,cDNA,cDNA Probes,Probes, cDNA
D019295 Computational Biology A field of biology concerned with the development of techniques for the collection and manipulation of biological data, and the use of such data to make biological discoveries or predictions. This field encompasses all computational methods and theories for solving biological problems including manipulation of models and datasets. Bioinformatics,Molecular Biology, Computational,Bio-Informatics,Biology, Computational,Computational Molecular Biology,Bio Informatics,Bio-Informatic,Bioinformatic,Biologies, Computational Molecular,Biology, Computational Molecular,Computational Molecular Biologies,Molecular Biologies, Computational

Related Publications

M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
January 2002, Genome biology,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
July 2003, Briefings in functional genomics & proteomics,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
March 2001, Nature genetics,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
April 2000, Genome research,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
January 2003, Annual review of genomics and human genetics,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
December 2001, Computers & chemistry,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
June 2007, Science (New York, N.Y.),
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
January 2019, Database : the journal of biological databases and curation,
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
April 2005, Bioinformatics (Oxford, England),
M G Reese, and G Hartzell, and N L Harris, and U Ohler, and J F Abril, and S E Lewis
September 2003, Genome research,
Copied contents to your clipboard!