Confidence limits for homology in protein or gene sequences. The c-myc oncogene and adenovirus E1a protein. 1985

A D McLachlan, and D R Boswell

We describe new tests, of general application, for deciding whether two proteins or DNA sequences are significantly homologous, in cases where the relationship is neither evidently true nor evidently false. Ralston and Bishop's comparison of the c-myc oncogene with the adenovirus E1a protein is discussed as an example. When the comparison matrix test is used to establish a homology between two sequences it is necessary that the number of high scores exceeds the expected mean level for random sequences by a statistically significant margin. The mean level itself is found from the double matching probability distribution. In examples where the number of high scores is larger than expected, but the highest score is not in itself exceptional, the variance of the numbers of scores expected for unrelated sequences is an important factor. We have analysed these variances by several methods. A simple binomial distribution gives only a rather inaccurate and low first estimate, but we derive a more rigorous and accurate statistical treatment, to take account of the correlations between scores in different parts of the comparison matrix. The theory is exact for random DNA or protein sequences with fluctuating compositions, selected by random draws from an infinite pool. In the more realistic situation, where sequences of fixed composition are formed by random permutations of the original sets, the deviations are smaller, and have been analysed by computer simulation. We find that although the relationship proposed by Ralston & Bishop, between the c-myc oncogene and adenovirus E1a proteins, appears to be significant in the binomial approximation, it is not supported by the full analysis. We conclude that, in general, great care is needed to establish any weak homology on the basis of comparisons that include no truly exceptional high scores, but merely have an enhanced number of scores at the upper end of the expected distribution.

UI MeSH Term Description Entries
D009857 Oncogenes Genes whose gain-of-function alterations lead to NEOPLASTIC CELL TRANSFORMATION. They include, for example, genes for activators or stimulators of CELL PROLIFERATION such as growth factors, growth factor receptors, protein kinases, signal transducers, nuclear phosphoproteins, and transcription factors. A prefix of "v-" before oncogene symbols indicates oncogenes captured and transmitted by RETROVIRUSES; the prefix "c-" before the gene symbol of an oncogene indicates it is the cellular homolog (PROTO-ONCOGENES) of a v-oncogene. Transforming Genes,Oncogene,Transforming Gene,Gene, Transforming,Genes, Transforming
D004279 DNA, Viral Deoxyribonucleic acid that makes up the genetic material of viruses. Viral DNA
D000256 Adenoviridae A family of non-enveloped viruses infecting mammals (MASTADENOVIRUS) and birds (AVIADENOVIRUS) or both (ATADENOVIRUS). Infections may be asymptomatic or result in a variety of diseases. Adenoviruses,Ichtadenovirus,Adenovirus,Ichtadenoviruses
D000595 Amino Acid Sequence The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION. Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D001483 Base Sequence The sequence of PURINES and PYRIMIDINES in nucleic acids and polynucleotides. It is also called nucleotide sequence. DNA Sequence,Nucleotide Sequence,RNA Sequence,DNA Sequences,Base Sequences,Nucleotide Sequences,RNA Sequences,Sequence, Base,Sequence, DNA,Sequence, Nucleotide,Sequence, RNA,Sequences, Base,Sequences, DNA,Sequences, Nucleotide,Sequences, RNA
D013223 Statistics as Topic Works about the science and art of collecting, summarizing, and analyzing data that are subject to random variation. Area Analysis,Estimation Technics,Estimation Techniques,Indirect Estimation Technics,Indirect Estimation Techniques,Multiple Classification Analysis,Service Statistics,Statistical Study,Statistics, Service,Tables and Charts as Topic,Analyses, Area,Analyses, Multiple Classification,Area Analyses,Classification Analyses, Multiple,Classification Analysis, Multiple,Estimation Technic, Indirect,Estimation Technics, Indirect,Estimation Technique,Estimation Technique, Indirect,Estimation Techniques, Indirect,Indirect Estimation Technic,Indirect Estimation Technique,Multiple Classification Analyses,Statistical Studies,Studies, Statistical,Study, Statistical,Technic, Indirect Estimation,Technics, Estimation,Technics, Indirect Estimation,Technique, Estimation,Technique, Indirect Estimation,Techniques, Estimation,Techniques, Indirect Estimation
D014764 Viral Proteins Proteins found in any species of virus. Gene Products, Viral,Viral Gene Products,Viral Gene Proteins,Viral Protein,Protein, Viral,Proteins, Viral

Related Publications

A D McLachlan, and D R Boswell
September 1993, Proceedings of the National Academy of Sciences of the United States of America,
A D McLachlan, and D R Boswell
May 2009, Journal of virology,
A D McLachlan, and D R Boswell
January 1986, Molecular and cellular biology,
A D McLachlan, and D R Boswell
January 2001, Breast cancer (Tokyo, Japan),
A D McLachlan, and D R Boswell
January 1998, Nucleic acids research,
A D McLachlan, and D R Boswell
October 1996, Oncogene,
A D McLachlan, and D R Boswell
January 1986, Princess Takamatsu symposia,
Copied contents to your clipboard!