Some useful statistical properties of position-weight matrices. 1994

J M Claverie
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.

Position-weight matrices (or profiles) are simple mathematical objects traditionally used to capture the information about local sequence patterns (or motifs) characteristic of a given structure or function. Although weight matrices can lead to fast database scanning algorithms their usage has been limited, due to the lack of a reliable method to assess the statistical significance of the matching scores. In this article I first review 3 different computation scheme for designing weight matrices from a block-alignment of any (small or large) number of sequences. I then show that, for patterns spanning 10 positions or more, the best scores expected from matching random sequences are distributed according to the extreme value (Gumbel) distribution. The threshold of statistical significance assessed from this distribution perfectly delineate the range of scores characterizing "true positive" sequences (biological significant matches). This result allows weight matrices to be used to scan an entire protein database for patterns in a highly sensitive way. MODEST (MOtif DEsign and Search Tools), a suite of programs in Unix/C, implements these statistical improvements and is available upon E-mail request (jmc@ncbi.nlm.nih.gov).

UI MeSH Term Description Entries
D008969 Molecular Sequence Data Descriptions of specific amino acid, carbohydrate, or nucleotide sequences which have appeared in the published literature and/or are deposited in and maintained by databanks such as GENBANK, European Molecular Biology Laboratory (EMBL), National Biomedical Research Foundation (NBRF), or other sequence repositories. Sequence Data, Molecular,Molecular Sequencing Data,Data, Molecular Sequence,Data, Molecular Sequencing,Sequencing Data, Molecular
D011506 Proteins Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein. Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D006026 Glycoside Hydrolases Any member of the class of enzymes that catalyze the cleavage of the glycosidic linkage of glycosides and the addition of water to the resulting molecules. Endoglycosidase,Exoglycosidase,Glycohydrolase,Glycosidase,Glycosidases,Glycoside Hydrolase,Endoglycosidases,Exoglycosidases,Glycohydrolases,Hydrolase, Glycoside,Hydrolases, Glycoside
D000595 Amino Acid Sequence The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION. Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D001699 Biometry The use of statistical and mathematical methods to analyze biological observations and phenomena. Biometric Analysis,Biometrics,Analyses, Biometric,Analysis, Biometric,Biometric Analyses
D016208 Databases, Factual Extensive collections, reputedly complete, of facts and data garnered from material of a specialized subject area and made available for analysis and application. The collection can be automated by various contemporary methods for retrieval. The concept should be differentiated from DATABASES, BIBLIOGRAPHIC which is restricted to collections of bibliographic references. Databanks, Factual,Data Banks, Factual,Data Bases, Factual,Data Bank, Factual,Data Base, Factual,Databank, Factual,Database, Factual,Factual Data Bank,Factual Data Banks,Factual Data Base,Factual Data Bases,Factual Databank,Factual Databanks,Factual Database,Factual Databases
D016415 Sequence Alignment The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms. Sequence Homology Determination,Determination, Sequence Homology,Alignment, Sequence,Alignments, Sequence,Determinations, Sequence Homology,Sequence Alignments,Sequence Homology Determinations
D017386 Sequence Homology, Amino Acid The degree of similarity between sequences of amino acids. This information is useful for the analyzing genetic relatedness of proteins and species. Homologous Sequences, Amino Acid,Amino Acid Sequence Homology,Homologs, Amino Acid Sequence,Homologs, Protein Sequence,Homology, Protein Sequence,Protein Sequence Homologs,Protein Sequence Homology,Sequence Homology, Protein,Homolog, Protein Sequence,Homologies, Protein Sequence,Protein Sequence Homolog,Protein Sequence Homologies,Sequence Homolog, Protein,Sequence Homologies, Protein,Sequence Homologs, Protein

Related Publications

J M Claverie
September 1996, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics,
J M Claverie
April 2021, Physical review. E,
J M Claverie
June 2005, Bioinformatics (Oxford, England),
J M Claverie
September 1968, Psychological bulletin,
J M Claverie
July 1980, Circulation research,
J M Claverie
December 1998, Environmental health perspectives,
J M Claverie
January 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference,
J M Claverie
July 1966, Multivariate behavioral research,
J M Claverie
January 2011, IEEE/ACM transactions on computational biology and bioinformatics,
Copied contents to your clipboard!