Group sparse canonical correlation analysis for genomic data integration. 2013

Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
Biomedical Engineering Department, Tulane University, New Orleans, LA, USA.

BACKGROUND The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group). RESULTS We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features. CONCLUSIONS The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.

UI MeSH Term Description Entries
D005819 Genetic Markers A phenotypically recognizable genetic trait which can be used to identify a genetic locus, a linkage group, or a recombination event. Chromosome Markers,DNA Markers,Markers, DNA,Markers, Genetic,Genetic Marker,Marker, Genetic,Chromosome Marker,DNA Marker,Marker, Chromosome,Marker, DNA,Markers, Chromosome
D006801 Humans Members of the species Homo sapiens. Homo sapiens,Man (Taxonomy),Human,Man, Modern,Modern Man
D000465 Algorithms A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. Algorithm
D015894 Genome, Human The complete genetic complement contained in the DNA of a set of CHROMOSOMES in a HUMAN. The length of the human genome is about 3 billion base pairs. Human Genome,Genomes, Human,Human Genomes
D055106 Genome-Wide Association Study An analysis comparing the allele frequencies of all available (or a whole GENOME representative set of) polymorphic markers to identify gene candidates or quantitative trait loci associated with a specific organism trait or specific disease or condition. Genome Wide Association Analysis,Genome Wide Association Study,GWA Study,Genome Wide Association Scan,Genome Wide Association Studies,Whole Genome Association Analysis,Whole Genome Association Study,Association Studies, Genome-Wide,Association Study, Genome-Wide,GWA Studies,Genome-Wide Association Studies,Studies, GWA,Studies, Genome-Wide Association,Study, GWA,Study, Genome-Wide Association
D056915 DNA Copy Number Variations Stretches of genomic DNA that exist in different multiples between individuals. Many copy number variations have been associated with susceptibility or resistance to disease. Copy Number Polymorphism,DNA Copy Number Variant,Copy Number Changes, DNA,Copy Number Polymorphisms,Copy Number Variants, DNA,Copy Number Variation, DNA,DNA Copy Number Change,DNA Copy Number Changes,DNA Copy Number Polymorphism,DNA Copy Number Polymorphisms,DNA Copy Number Variants,DNA Copy Number Variation,Polymorphism, Copy Number,Polymorphisms, Copy Number
D018511 Systems Integration The procedures involved in combining separately developed modules, components, or subsystems so that they work together as a complete system. (From McGraw-Hill Dictionary of Scientific and Technical Terms, 4th ed) Integration, Systems,Integrations, Systems,Systems Integrations
D020641 Polymorphism, Single Nucleotide A single nucleotide variation in a genetic sequence that occurs at appreciable frequency in the population. SNPs,Single Nucleotide Polymorphism,Nucleotide Polymorphism, Single,Nucleotide Polymorphisms, Single,Polymorphisms, Single Nucleotide,Single Nucleotide Polymorphisms
D023281 Genomics The systematic study of the complete DNA sequences (GENOME) of organisms. Included is construction of complete genetic, physical, and transcript maps, and the analysis of this structural genomic information on a global scale such as in GENOME WIDE ASSOCIATION STUDIES. Functional Genomics,Structural Genomics,Comparative Genomics,Genomics, Comparative,Genomics, Functional,Genomics, Structural

Related Publications

Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
January 2009, Statistical applications in genetics and molecular biology,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
January 2009, Statistical applications in genetics and molecular biology,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
August 2014, Medical image analysis,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
March 2024, Biometrical journal. Biometrische Zeitschrift,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
September 2020, Biometrika,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
August 2016, BMC systems biology,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
May 2021, Methods (San Diego, Calif.),
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
December 2010, Proceedings. IEEE International Conference on Bioinformatics and Biomedicine,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
January 2021, Frontiers in genetics,
Dongdong Lin, and Jigang Zhang, and Jingyao Li, and Vince D Calhoun, and Hong-Wen Deng, and Yu-Ping Wang
January 2013, Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference,
Copied contents to your clipboard!