RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. 2024

Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.

UI MeSH Term Description Entries
D011506 Proteins Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein. Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D001105 Archaea One of the three domains of life (the others being BACTERIA and Eukarya), formerly called Archaebacteria under the taxon Bacteria, but now considered separate and distinct. They are characterized by: (1) the presence of characteristic tRNAs and ribosomal RNAs; (2) the absence of peptidoglycan cell walls; (3) the presence of ether-linked lipids built from branched-chain subunits; and (4) their occurrence in unusual habitats. While archaea resemble bacteria in morphology and genomic organization, they resemble eukarya in their method of genomic replication. The domain contains at least four kingdoms: CRENARCHAEOTA; EURYARCHAEOTA; NANOARCHAEOTA; and KORARCHAEOTA. Archaebacteria,Archaeobacteria,Archaeon,Archebacteria
D001419 Bacteria One of the three domains of life (the others being Eukarya and ARCHAEA), also called Eubacteria. They are unicellular prokaryotic microorganisms which generally possess rigid cell walls, multiply by cell division, and exhibit three principal forms: round or coccal, rodlike or bacillary, and spiral or spirochetal. Bacteria can be classified by their response to OXYGEN: aerobic, anaerobic, or facultatively anaerobic; by the mode by which they obtain their energy: chemotrophy (via chemical reaction) or PHOTOTROPHY (via light reaction); for chemotrophs by their source of chemical energy: CHEMOLITHOTROPHY (from inorganic compounds) or chemoorganotrophy (from organic compounds); and by their source for CARBON; NITROGEN; etc.; HETEROTROPHY (from organic sources) or AUTOTROPHY (from CARBON DIOXIDE). They can also be classified by whether or not they stain (based on the structure of their CELL WALLS) with CRYSTAL VIOLET dye: gram-negative or gram-positive. Eubacteria
D016680 Genome, Bacterial The genetic complement of a BACTERIA as represented in its DNA. Bacterial Genome,Bacterial Genomes,Genomes, Bacterial
D054892 Metagenome A collective genome representative of the many organisms, primarily microorganisms, existing in a community. Metagenomes
D058977 Molecular Sequence Annotation The addition of descriptive information about the function or structure of a molecular sequence to its MOLECULAR SEQUENCE DATA record. Gene Annotation,Protein Annotation,Annotation, Gene,Annotation, Molecular Sequence,Annotation, Protein,Annotations, Gene,Annotations, Molecular Sequence,Annotations, Protein,Gene Annotations,Molecular Sequence Annotations,Protein Annotations,Sequence Annotation, Molecular,Sequence Annotations, Molecular
D020407 Internet A loose confederation of computer communication networks around the world. The networks that make up the Internet are connected through several backbone networks. The Internet grew out of the US Government ARPAnet project and was designed to facilitate information exchange. World Wide Web,Cyber Space,Cyberspace,Web, World Wide,Wide Web, World
D020745 Genome, Archaeal The genetic complement of an archaeal organism (ARCHAEA) as represented in its DNA. Archaeal Genome,Archaeal Genomes,Genomes, Archaeal
D030561 Databases, Nucleic Acid Databases containing information about NUCLEIC ACIDS such as BASE SEQUENCE; SNPS; NUCLEIC ACID CONFORMATION; and other properties. Information about the DNA fragments kept in a GENE LIBRARY or GENOMIC LIBRARY is often maintained in DNA databases. DDBJ,DNA Data Bank of Japan,DNA Data Banks,DNA Databases,Databases, DNA,Databases, DNA Sequence,Databases, Nucleic Acid Sequence,Databases, RNA,Databases, RNA Sequence,EMBL Nucleotide Sequence Database,GenBank,Nucleic Acid Databases,RNA Databases,DNA Databanks,DNA Sequence Databases,European Molecular Biology Laboratory Nucleotide Sequence Database,Nucleic Acid Sequence Databases,RNA Sequence Databases,Bank, DNA Data,Banks, DNA Data,DNA Data Bank,DNA Databank,DNA Database,DNA Sequence Database,Data Bank, DNA,Data Banks, DNA,Databank, DNA,Databanks, DNA,Database, DNA,Database, DNA Sequence,Database, Nucleic Acid,Database, RNA,Database, RNA Sequence,Nucleic Acid Database,RNA Database,RNA Sequence Database,Sequence Database, DNA,Sequence Database, RNA,Sequence Databases, DNA,Sequence Databases, RNA

Related Publications

Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
August 2016, Nucleic acids research,
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
September 2017, Journal of computational biology : a journal of computational molecular cell biology,
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
March 2018, Bioinformatics (Oxford, England),
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
October 2015, Mammalian genome : official journal of the International Mammalian Genome Society,
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
June 2015, Journal of microbiological methods,
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
January 2022, Methods in molecular biology (Clifton, N.J.),
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
August 2013, Bioinformatics (Oxford, England),
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
July 2014, Bioinformatics (Oxford, England),
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
February 2004, Current protocols in bioinformatics,
Daniel H Haft, and Azat Badretdin, and George Coulouris, and Michael DiCuccio, and A Scott Durkin, and Eric Jovenitti, and Wenjun Li, and Megdelawit Mersha, and Kathleen R O'Neill, and Joel Virothaisakun, and Françoise Thibaud-Nissen
February 2004, Current protocols in human genetics,
Copied contents to your clipboard!