Analysis Commons, a team approach to discovery — in a big-data environment for genetic epideniology

The “Analysis Commons” –– which relies on a new team-science model for genetic epide­miology –– integrates multi-omic data and rich phenotypic and clinical information from diverse population studies into a single shared analytic platform that leverages the resources of a cloud-computing environment and allows for distributed access. The number of whole-genome sequencing (WGS) studies with large sample sizes is rap­idly expanding. Projects such as the NHLBI TOPMed Program, the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, and the Centers for Common Disease Genomics (CCDG), among others, have already conducted WGS in more than 100,000 individuals, and the Personalized Medicine Initiative promises (soon) whole-genome sequencing in over a million samples. These programs span a diverse set of studies and institutions, many of which lack the compu­tational infrastructure to store and compute on this scale of data. Genomic, epigenomic, metabolic and proteomic data derived from expensive assays often do not exist in large numbers in any single study, but represent a powerful discovery resource when they are combined across studies and integrated with phenotypic data.

Altogether, many population-based stud­ies have now collected data on tens of thousands of variables over several decades, and addition of WGS data to cohorts with long-term prospective follow-up provides a power­ful resource for immediate discovery. Analysis of WGS data for large samples presents formi­dable computational and administrative chal­lenges. Evaluation of rare genetic variants in WGS data requires manipulation of data sets that are tens to hundreds of terabytes in size and are prohibitively large for exchange between analysis sites. In contrast, pooled data sets –– which include genotype and phenotype data from all participants in the contributing individual studies –– provide for practical and efficient WGS analysis. Creation of such large pooled data sets containing harmonized multi-omic, phenotype and clinical data with appropriate meta-data (e.g. example, parent-study information and use permissions) is dif­ficult and time consuming.

The cloud-based Analysis Commons [see attached article] brings together genotype and phenotype data from multiple studies in a setting that is accessible by multiple investigators. This framework addresses many of the challenges of multi-center WGS analyses –– including data-sharing mechanisms, phenotype harmonization, integrated multi-omics analyses, annotation, and computational flexibility. In this setting, the computational pipeline facilitates a sequence-to-discovery analysis workflow illustrated [see attached] by an analysis of plasma fibrinogen levels in almost 4,000 individuals from the National Heart, Lung, and Blood Institute (NHLBI)’s Trans-Omics for Precision Medicine (TOPMed) WGS program. The Analysis Commons represents a novel model for translating WGS resources from a massive quantity of phenotypic and genomic data into knowledge of the determinants of health and disease risk in diverse human populations..!!

Nat Genet Nov 2o17; 49: 1560–1563

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.