Making the Most of Clumping and Thresholding for Polygenic Risk Scores

Fine-tuning polygenic risk scores (PRS) falls within the GEITP theme of gene-environment interactions. The ability to accurately predict disease risk (for complex diseases that involve contributions from hundreds or thousands of genes) — is one of the goals of “precision medicine”. As more population-scale genetic datasets become available, PRS are expected to become more accurate and clinically more relevant. The most commonly used method for computing PRS is clumping and thresholding (C+T; also known as pruning and thresholding, P+T). The C+T polygenic score is defined as the sum of allele counts (i.e. genotypes), weighted by estimated effect-sizes — obtained from genome-wide association studies (GWAS), to which two filtering steps have been applied.

More specifically, the variants are first clumped (C), so that only variants that are weakly correlated with one another are retained. Clumping selects the most significant variant iteratively, computes a correlation between this index variant and nearby variants within some genetic distance wc, and removes all nearby variants that are correlated with this index variant beyond a

particular value, r2c. Thresholding (T) consists of removing variants that have a P-value larger than a chosen level of significance

(P > PT). Both steps — clumping and thresholding — represent statistical compromises between signal and noise. The clumping step prunes redundant correlated effects caused by linkage disequilibrium (LD; recall that ‘LD’ is the non-random association of alleles at different loci, along any chromosome, in any given population) between variants. However, this procedure may also remove independently predictive variants in LD.

Similarly, the thresholding step must balance between including truly predictive variants and reducing noise in the score by excluding null-effects. When applying C+T, one must select three hyper-parameters — namely, the squared correlation threshold (r2c) and the window size (wc) of clumping, along with the P-value threshold (PT). To compute the PRS, target sample genotypes are usually imputed [i.e. assigned to something by inference (estimated) from the value of the products or processes to which it contributes] to some degree of precision, in order to match the variants of summary statistics. Inclusion of imputed variants with relatively low imputation quality is common (if we assume that using more variants in the model yields better prediction). Authors [see attached article] explored the validity of this approach and suggested an additional INFOT threshold on the quality of imputation (often called the INFO score) as a fourth parameter of the C+T method.

Authors [see article] demonstrate an efficient way to derive thousands of different C+T scores — corresponding to a grid over these four hyper-parameters. For example, it takes several hours to derive 123,000 different C+T scores for 300,000 individuals and 1 million variants, using 16 physical cores. Authors found that optimizing — over these four hyper-parameters — improves the predictive performance of C+T in both simulations and real-data applications, as compared to tuning only the P-value threshold. Authors further proposed stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores, by using an efficient penalized regression. Authors applied SCT to eight different case-control diseases in the UK Biobank data, and they found that SCT substantially improves prediction accuracy — with an average area-under-the-curve (AUC) increase of 0.035 over the standard C+T method. In my opinion, the ability to accurately predict disease risk from PRS still has a lo-o-o-o-n-n-n-ng way to go. ☹


Am J Hum Genet 5 Dec 2019; 105: 1213–1221

This entry was posted in Center for Environmental Genetics. Bookmark the permalink.