Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities. 1998

J Gracy, and P Argos
European Molecular Biology Laboratory, Heidelberg, Germany.

BACKGROUND Decomposing each protein into modular domains is a basic prerequisite to classify accurately structural units in biological molecules. Boundaries between domains are indicated by two similar amino acid sequence segments located within the same protein (repeats) or within homologous proteins at notably different distances from their respective N- or C-termini. RESULTS We have developed an automated method that combines such positional constraints derived from various detected pairwise sequence similarities to delineate the modular organization of proteins. The procedure has been applied to a non-redundant data set of 26 990 proteins whose sequences were taken from the PIR and SWISS-PROT databanks and shared <60% sequence identity amongst pairs. The resultant clustering, delineation and multiple alignment of 24 380 sequence fragments yielded a new database of 4364 domain families. Comparison of the domain collection with that of PRODOM indicates a clear improvement in the number and size of domain families, domain boundaries and multiple sequence alignments. The accuracy and sensitivity of the method are illustrated by results obtained for ankyrin-like repeats and EGF-like modules. BACKGROUND The resulting database, called DOMO, is available through the database search routine SRS at Infobiogen (http://www.infobiogen.fr/srs5/), EBI (http://srs.ebi.ac.uk:5000/) and EMBL (http://www.embl-heidelberg.de/srs5/) World Wide Web sites. BACKGROUND gracy@infobiogen.fr

UI MeSH Term Description Entries
D008969 Molecular Sequence Data Descriptions of specific amino acid, carbohydrate, or nucleotide sequences which have appeared in the published literature and/or are deposited in and maintained by databanks such as GENBANK, European Molecular Biology Laboratory (EMBL), National Biomedical Research Foundation (NBRF), or other sequence repositories. Sequence Data, Molecular,Molecular Sequencing Data,Data, Molecular Sequence,Data, Molecular Sequencing,Sequencing Data, Molecular
D011506 Proteins Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein. Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D011786 Quality Control A system for verifying and maintaining a desired level of quality in a product or process by careful planning, use of proper equipment, continued inspection, and corrective action as required. (Random House Unabridged Dictionary, 2d ed) Control, Quality,Controls, Quality,Quality Controls
D012091 Repetitive Sequences, Nucleic Acid Sequences of DNA or RNA that occur in multiple copies. There are several types: INTERSPERSED REPETITIVE SEQUENCES are copies of transposable elements (DNA TRANSPOSABLE ELEMENTS or RETROELEMENTS) dispersed throughout the genome. TERMINAL REPEAT SEQUENCES flank both ends of another sequence, for example, the long terminal repeats (LTRs) on RETROVIRUSES. Variations may be direct repeats, those occurring in the same direction, or inverted repeats, those opposite to each other in direction. TANDEM REPEAT SEQUENCES are copies which lie adjacent to each other, direct or inverted (INVERTED REPEAT SEQUENCES). DNA Repetitious Region,Direct Repeat,Genes, Selfish,Nucleic Acid Repetitive Sequences,Repetitive Region,Selfish DNA,Selfish Genes,DNA, Selfish,Repetitious Region, DNA,Repetitive Sequence,DNA Repetitious Regions,DNAs, Selfish,Direct Repeats,Gene, Selfish,Repeat, Direct,Repeats, Direct,Repetitious Regions, DNA,Repetitive Regions,Repetitive Sequences,Selfish DNAs,Selfish Gene
D002135 Calcium-Binding Proteins Proteins to which calcium ions are bound. They can act as transport proteins, regulator proteins, or activator proteins. They typically contain EF HAND MOTIFS. Calcium Binding Protein,Calcium-Binding Protein,Calcium Binding Proteins,Binding Protein, Calcium,Binding Proteins, Calcium,Protein, Calcium Binding,Protein, Calcium-Binding
D004815 Epidermal Growth Factor A 6-kDa polypeptide growth factor initially discovered in mouse submaxillary glands. Human epidermal growth factor was originally isolated from urine based on its ability to inhibit gastric secretion and called urogastrone. Epidermal growth factor exerts a wide variety of biological effects including the promotion of proliferation and differentiation of mesenchymal and EPITHELIAL CELLS. It is synthesized as a transmembrane protein which can be cleaved to release a soluble active form. EGF,Epidermal Growth Factor-Urogastrone,Urogastrone,Human Urinary Gastric Inhibitor,beta-Urogastrone,Growth Factor, Epidermal,Growth Factor-Urogastrone, Epidermal,beta Urogastrone
D000465 Algorithms A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. Algorithm
D000595 Amino Acid Sequence The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION. Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D016000 Cluster Analysis A set of statistical methods used to group variables or observations into strongly inter-related subgroups. In epidemiology, it may be used to analyze a closely grouped series of events or cases of disease or other health-related phenomenon with well-defined distribution patterns in relation to time or place or both. Clustering,Analyses, Cluster,Analysis, Cluster,Cluster Analyses,Clusterings
D016208 Databases, Factual Extensive collections, reputedly complete, of facts and data garnered from material of a specialized subject area and made available for analysis and application. The collection can be automated by various contemporary methods for retrieval. The concept should be differentiated from DATABASES, BIBLIOGRAPHIC which is restricted to collections of bibliographic references. Databanks, Factual,Data Banks, Factual,Data Bases, Factual,Data Bank, Factual,Data Base, Factual,Databank, Factual,Database, Factual,Factual Data Bank,Factual Data Banks,Factual Data Base,Factual Data Bases,Factual Databank,Factual Databanks,Factual Database,Factual Databases

Related Publications

J Gracy, and P Argos
April 2003, Protein science : a publication of the Protein Society,
J Gracy, and P Argos
September 2010, Bioinformatics (Oxford, England),
J Gracy, and P Argos
January 2005, Nucleic acids research,
J Gracy, and P Argos
January 1996, Methods in enzymology,
J Gracy, and P Argos
June 2004, Briefings in bioinformatics,
J Gracy, and P Argos
January 2019, Methods in molecular biology (Clifton, N.J.),
J Gracy, and P Argos
February 2013, Proteins,
J Gracy, and P Argos
January 1990, Methods in enzymology,
Copied contents to your clipboard!