Biomaterial Database

Protein-coding regions prediction combining similarity searches and conservative evolutionary properties of protein-coding sequences. 1999

I B Rogozin, and D D'Angelo, and L Milanesi

Istituto di Tecnologie Biomediche Avanzate CNR, via Fratelli Cervi 93, 20090 Segrate, Milan, Italy.

The gene identification procedure in a completely new gene with no good homology with protein sequences can be a very complex task. In order to identify the protein-coding region, a new method, 'SYNCOD', based on the analysis of conservative evolutionary properties of coding regions, has been realized. This program is able to identify and use the coding region homologies of the non-annotated (unknown) protein-coding sequences already present in the nucleotide sequence databases by using the alignment produced by BLASTN. The ratio of number mismatches resulting in synonymous codons to the number of mismatches resulting in non-synonymous codons is estimated for each open reading frame. Monte Carlo simulations are then used to estimate the significance of the ratio deviation from random behavior. The SYNCOD program has been tested on generated random sequences and on different control sets. The high accuracy of predicting protein-coding regions (the correlation coefficient, CC, varies from 0.67 to 0.79) and the high specificity (the portion of wrong exons, WE, varies from 0.06 to 0.07) have proved to be important features of the suggested approach. The SYNCOD program is resident on the ITBA-CNR Web Server and can be used via the Internet (URL: www.itba.mi.cnr.it/webgene).

UI	MeSH Term	Description	Entries
D008432	Mathematical Computing	Computer-assisted interpretation and analysis of various mathematical functions related to a particular problem.	Statistical Computing,Computing, Statistical,Mathematic Computing,Statistical Programs, Computer Based,Computing, Mathematic,Computing, Mathematical,Computings, Mathematic,Computings, Mathematical,Computings, Statistical,Mathematic Computings,Mathematical Computings,Statistical Computings
D008969	Molecular Sequence Data	Descriptions of specific amino acid, carbohydrate, or nucleotide sequences which have appeared in the published literature and/or are deposited in and maintained by databanks such as GENBANK, European Molecular Biology Laboratory (EMBL), National Biomedical Research Foundation (NBRF), or other sequence repositories.	Sequence Data, Molecular,Molecular Sequencing Data,Data, Molecular Sequence,Data, Molecular Sequencing,Sequencing Data, Molecular
D009010	Monte Carlo Method	In statistics, a technique for numerically approximating the solution of a mathematical problem by studying the distribution of some random variable, often generated by a computer. The name alludes to the randomness characteristic of the games of chance played at the gambling casinos in Monte Carlo. (From Random House Unabridged Dictionary, 2d ed, 1993)	Method, Monte Carlo
D011506	Proteins	Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein.	Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D000465	Algorithms	A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task.	Algorithm
D000595	Amino Acid Sequence	The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION.	Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D001483	Base Sequence	The sequence of PURINES and PYRIMIDINES in nucleic acids and polynucleotides. It is also called nucleotide sequence.	DNA Sequence,Nucleotide Sequence,RNA Sequence,DNA Sequences,Base Sequences,Nucleotide Sequences,RNA Sequences,Sequence, Base,Sequence, DNA,Sequence, Nucleotide,Sequence, RNA,Sequences, Base,Sequences, DNA,Sequences, Nucleotide,Sequences, RNA
D012984	Software	Sequential operating programs and data which instruct the functioning of a digital computer.	Computer Programs,Computer Software,Open Source Software,Software Engineering,Software Tools,Computer Applications Software,Computer Programs and Programming,Computer Software Applications,Application, Computer Software,Applications Software, Computer,Applications Softwares, Computer,Applications, Computer Software,Computer Applications Softwares,Computer Program,Computer Software Application,Engineering, Software,Open Source Softwares,Program, Computer,Programs, Computer,Software Application, Computer,Software Applications, Computer,Software Tool,Software, Computer,Software, Computer Applications,Software, Open Source,Softwares, Computer Applications,Softwares, Open Source,Source Software, Open,Source Softwares, Open,Tool, Software,Tools, Software
D016208	Databases, Factual	Extensive collections, reputedly complete, of facts and data garnered from material of a specialized subject area and made available for analysis and application. The collection can be automated by various contemporary methods for retrieval. The concept should be differentiated from DATABASES, BIBLIOGRAPHIC which is restricted to collections of bibliographic references.	Databanks, Factual,Data Banks, Factual,Data Bases, Factual,Data Bank, Factual,Data Base, Factual,Databank, Factual,Database, Factual,Factual Data Bank,Factual Data Banks,Factual Data Base,Factual Data Bases,Factual Databank,Factual Databanks,Factual Database,Factual Databases
D016415	Sequence Alignment	The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms.	Sequence Homology Determination,Determination, Sequence Homology,Alignment, Sequence,Alignments, Sequence,Determinations, Sequence Homology,Sequence Alignments,Sequence Homology Determinations