The “Analysis Commons” –– which relies on a new team-science model for genetic epidemiology –– integrates multi-omic data and rich phenotypic and clinical information from diverse population studies into a single shared analytic platform that leverages the resources of a cloud-computing environment and allows for distributed access. The number of whole-genome sequencing (WGS) studies with large sample sizes is rapidly expanding. Projects such as the NHLBI TOPMed Program, the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, and the Centers for Common Disease Genomics (CCDG), among others, have already conducted WGS in more than 100,000 individuals, and the Personalized Medicine Initiative promises (soon) whole-genome sequencing in over a million samples. These programs span a diverse set of studies and institutions, many of which lack the computational infrastructure to store and compute on this scale of data. Genomic, epigenomic, metabolic and proteomic data derived from expensive assays often do not exist in large numbers in any single study, but represent a powerful discovery resource when they are combined across studies and integrated with phenotypic data.
Altogether, many population-based studies have now collected data on tens of thousands of variables over several decades, and addition of WGS data to cohorts with long-term prospective follow-up provides a powerful resource for immediate discovery. Analysis of WGS data for large samples presents formidable computational and administrative challenges. Evaluation of rare genetic variants in WGS data requires manipulation of data sets that are tens to hundreds of terabytes in size and are prohibitively large for exchange between analysis sites. In contrast, pooled data sets –– which include genotype and phenotype data from all participants in the contributing individual studies –– provide for practical and efficient WGS analysis. Creation of such large pooled data sets containing harmonized multi-omic, phenotype and clinical data with appropriate meta-data (e.g. example, parent-study information and use permissions) is difficult and time consuming.
The cloud-based Analysis Commons [see attached article] brings together genotype and phenotype data from multiple studies in a setting that is accessible by multiple investigators. This framework addresses many of the challenges of multi-center WGS analyses –– including data-sharing mechanisms, phenotype harmonization, integrated multi-omics analyses, annotation, and computational flexibility. In this setting, the computational pipeline facilitates a sequence-to-discovery analysis workflow illustrated [see attached] by an analysis of plasma fibrinogen levels in almost 4,000 individuals from the National Heart, Lung, and Blood Institute (NHLBI)’s Trans-Omics for Precision Medicine (TOPMed) WGS program. The Analysis Commons represents a novel model for translating WGS resources from a massive quantity of phenotypic and genomic data into knowledge of the determinants of health and disease risk in diverse human populations..!!
Nat Genet Nov 2o17; 49: 1560–1563