Biomaterial Database

Comparison of methods for searching protein sequence databases. 1995

W R Pearson

Department of Biochemistry, University of Virginia, Charlottesville 22908, USA.

We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.

UI	MeSH Term	Description	Entries
D011336	Probability	The study of chance processes or the relative frequency characterizing a chance process.	Probabilities
D011506	Proteins	Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein.	Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D012044	Regression Analysis	Procedures for finding the mathematical function which best describes the relationship between a dependent variable and one or more independent variables. In linear regression (see LINEAR MODELS) the relationship is constrained to be a straight line and LEAST-SQUARES ANALYSIS is used to determine the best fit. In logistic regression (see LOGISTIC MODELS) the dependent variable is qualitative rather than continuously variable and LIKELIHOOD FUNCTIONS are used to find the best relationship. In multiple regression, the dependent variable is considered to depend on more than a single independent variable.	Regression Diagnostics,Statistical Regression,Analysis, Regression,Analyses, Regression,Diagnostics, Regression,Regression Analyses,Regression, Statistical,Regressions, Statistical,Statistical Regressions
D005069	Evaluation Studies as Topic	Works about studies that determine the effectiveness or value of processes, personnel, and equipment, or the material on conducting such studies.	Critique,Evaluation Indexes,Evaluation Methodology,Evaluation Report,Evaluation Research,Methodology, Evaluation,Pre-Post Tests,Qualitative Evaluation,Quantitative Evaluation,Theoretical Effectiveness,Use-Effectiveness,Critiques,Effectiveness, Theoretical,Evaluation Methodologies,Evaluation Reports,Evaluation, Qualitative,Evaluation, Quantitative,Evaluations, Qualitative,Evaluations, Quantitative,Indexes, Evaluation,Methodologies, Evaluation,Pre Post Tests,Pre-Post Test,Qualitative Evaluations,Quantitative Evaluations,Report, Evaluation,Reports, Evaluation,Research, Evaluation,Test, Pre-Post,Tests, Pre-Post,Use Effectiveness
D000465	Algorithms	A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task.	Algorithm
D000595	Amino Acid Sequence	The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION.	Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D000596	Amino Acids	Organic compounds that generally contain an amino (-NH2) and a carboxyl (-COOH) group. Twenty alpha-amino acids are the subunits which are polymerized to form proteins.	Amino Acid,Acid, Amino,Acids, Amino
D016208	Databases, Factual	Extensive collections, reputedly complete, of facts and data garnered from material of a specialized subject area and made available for analysis and application. The collection can be automated by various contemporary methods for retrieval. The concept should be differentiated from DATABASES, BIBLIOGRAPHIC which is restricted to collections of bibliographic references.	Databanks, Factual,Data Banks, Factual,Data Bases, Factual,Data Bank, Factual,Data Base, Factual,Databank, Factual,Database, Factual,Factual Data Bank,Factual Data Banks,Factual Data Base,Factual Data Bases,Factual Databank,Factual Databanks,Factual Database,Factual Databases
D016415	Sequence Alignment	The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms.	Sequence Homology Determination,Determination, Sequence Homology,Alignment, Sequence,Alignments, Sequence,Determinations, Sequence Homology,Sequence Alignments,Sequence Homology Determinations
D017386	Sequence Homology, Amino Acid	The degree of similarity between sequences of amino acids. This information is useful for the analyzing genetic relatedness of proteins and species.	Homologous Sequences, Amino Acid,Amino Acid Sequence Homology,Homologs, Amino Acid Sequence,Homologs, Protein Sequence,Homology, Protein Sequence,Protein Sequence Homologs,Protein Sequence Homology,Sequence Homology, Protein,Homolog, Protein Sequence,Homologies, Protein Sequence,Protein Sequence Homolog,Protein Sequence Homologies,Sequence Homolog, Protein,Sequence Homologies, Protein,Sequence Homologs, Protein