Efficient generative modeling of protein sequences using simple autoregressive models. 2021

Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France.

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.

UI MeSH Term Description Entries
D009154 Mutation Any detectable and heritable change in the genetic material that causes a change in the GENOTYPE and which is transmitted to daughter cells and to succeeding generations. Mutations
D011506 Proteins Linear POLYPEPTIDES that are synthesized on RIBOSOMES and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits. The specific sequence of AMINO ACIDS determines the shape the polypeptide will take, during PROTEIN FOLDING, and the function of the protein. Gene Products, Protein,Gene Proteins,Protein,Protein Gene Products,Proteins, Gene
D004843 Epistasis, Genetic A form of gene interaction whereby the expression of one gene interferes with or masks the expression of a different gene or genes. Genes whose expression interferes with or masks the effects of other genes are said to be epistatic to the effected genes. Genes whose expression is affected (blocked or masked) are hypostatic to the interfering genes. Deviation, Epistatic,Epistatic Deviation,Genes, Epistatic,Genes, Hypostatic,Epistases, Genetic,Gene-Gene Interaction, Epistatic,Gene-Gene Interactions, Epistatic,Genetic Epistases,Genetic Epistasis,Interaction Deviation,Non-Allelic Gene Interactions,Epistatic Gene,Epistatic Gene-Gene Interaction,Epistatic Gene-Gene Interactions,Epistatic Genes,Gene Gene Interaction, Epistatic,Gene Gene Interactions, Epistatic,Gene Interaction, Non-Allelic,Gene Interactions, Non-Allelic,Gene, Epistatic,Gene, Hypostatic,Hypostatic Gene,Hypostatic Genes,Interaction, Epistatic Gene-Gene,Interaction, Non-Allelic Gene,Interactions, Epistatic Gene-Gene,Interactions, Non-Allelic Gene,Non Allelic Gene Interactions,Non-Allelic Gene Interaction
D000069550 Machine Learning A type of ARTIFICIAL INTELLIGENCE that enable COMPUTERS to independently initiate and execute LEARNING when exposed to new data. Transfer Learning,Learning, Machine,Learning, Transfer
D000595 Amino Acid Sequence The order of amino acids as they occur in a polypeptide chain. This is referred to as the primary structure of proteins. It is of fundamental importance in determining PROTEIN CONFORMATION. Protein Structure, Primary,Amino Acid Sequences,Sequence, Amino Acid,Sequences, Amino Acid,Primary Protein Structure,Primary Protein Structures,Protein Structures, Primary,Structure, Primary Protein,Structures, Primary Protein
D015233 Models, Statistical Statistical formulations or analyses which, when applied to data and found to fit the data, are then used to verify the assumptions and parameters used in the analysis. Examples of statistical models are the linear model, binomial model, polynomial model, two-parameter model, etc. Probabilistic Models,Statistical Models,Two-Parameter Models,Model, Statistical,Models, Binomial,Models, Polynomial,Statistical Model,Binomial Model,Binomial Models,Model, Binomial,Model, Polynomial,Model, Probabilistic,Model, Two-Parameter,Models, Probabilistic,Models, Two-Parameter,Polynomial Model,Polynomial Models,Probabilistic Model,Two Parameter Models,Two-Parameter Model
D016415 Sequence Alignment The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms. Sequence Homology Determination,Determination, Sequence Homology,Alignment, Sequence,Alignments, Sequence,Determinations, Sequence Homology,Sequence Alignments,Sequence Homology Determinations
D019143 Evolution, Molecular The process of cumulative change at the level of DNA; RNA; and PROTEINS, over successive generations. Molecular Evolution,Genetic Evolution,Evolution, Genetic
D019295 Computational Biology A field of biology concerned with the development of techniques for the collection and manipulation of biological data, and the use of such data to make biological discoveries or predictions. This field encompasses all computational methods and theories for solving biological problems including manipulation of models and datasets. Bioinformatics,Molecular Biology, Computational,Bio-Informatics,Biology, Computational,Computational Molecular Biology,Bio Informatics,Bio-Informatic,Bioinformatic,Biologies, Computational Molecular,Biology, Computational Molecular,Computational Molecular Biologies,Molecular Biologies, Computational
D030562 Databases, Protein Databases containing information about PROTEINS such as AMINO ACID SEQUENCE; PROTEIN CONFORMATION; and other properties. Amino Acid Sequence Databases,Databases, Amino Acid Sequence,Protein Databases,Protein Sequence Databases,SWISS-PROT,Protein Structure Databases,SwissProt,Database, Protein,Database, Protein Sequence,Database, Protein Structure,Databases, Protein Sequence,Databases, Protein Structure,Protein Database,Protein Sequence Database,Protein Structure Database,SWISS PROT,Sequence Database, Protein,Sequence Databases, Protein,Structure Database, Protein,Structure Databases, Protein

Related Publications

Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
April 2021, Nature communications,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
July 2018, IEEE transactions on bio-medical engineering,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
February 2024, Nature biotechnology,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
September 2019, eLife,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
September 2017, Psychological methods,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
December 2021, Sensors (Basel, Switzerland),
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
June 2020, Physical review. E,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
September 2023, Briefings in bioinformatics,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
February 2022, Optics express,
Jeanne Trinquier, and Guido Uguzzoni, and Andrea Pagnani, and Francesco Zamponi, and Martin Weigt
January 2021, Biostatistics (Oxford, England),
Copied contents to your clipboard!